Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow linalg.lstsq to use svd to compute the result for rank deficient matrices. #125110

Closed
wants to merge 237 commits into from

Commits on Apr 28, 2024

  1. Add logic for lstsq to be able to use the SVD driver as a backend for…

    … when matrices are rank deficient.
    ZelboK committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    7372645 View commit details
    Browse the repository at this point in the history
  2. Formatting.

    ZelboK committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    99e7cfb View commit details
    Browse the repository at this point in the history
  3. run lintrunner -a

    ZelboK committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    e0fec86 View commit details
    Browse the repository at this point in the history
  4. Update aten/src/ATen/native/BatchLinearAlgebra.cpp

    Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
    ZelboK and lezcano committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    bb20952 View commit details
    Browse the repository at this point in the history
  5. Address comments. Clean up use of zeros and utilize higher level func…

    …tion linalg_svd for computation
    ZelboK committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    b6d6086 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    755e7d9 View commit details
    Browse the repository at this point in the history
  7. Formatting.

    ZelboK committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    6e8b3fd View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    c71e504 View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2024

  1. Configuration menu
    Copy the full SHA
    d5b0174 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    da81459 View commit details
    Browse the repository at this point in the history
  3. Set rank for svd workflow.

    ZelboK committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    c856b9e View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2024

  1. Update aten/src/ATen/native/BatchLinearAlgebra.cpp

    Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
    ZelboK and lezcano committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    de502bc View commit details
    Browse the repository at this point in the history

Commits on May 1, 2024

  1. Configuration menu
    Copy the full SHA
    3006f30 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2024

  1. Configuration menu
    Copy the full SHA
    428f02a View commit details
    Browse the repository at this point in the history

Commits on May 14, 2024

  1. Configuration menu
    Copy the full SHA
    4eab1c3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    fec9793 View commit details
    Browse the repository at this point in the history
  3. lint

    ZelboK committed May 14, 2024
    Configuration menu
    Copy the full SHA
    489afbe View commit details
    Browse the repository at this point in the history

Commits on May 19, 2024

  1. [export] handle aliased/unused params for unflattening (pytorch#125758)

    Pull Request resolved: pytorch#125758
    
    Aliased and unused params are currently an issue for strict-mode export. For a model like this:
    ```
    def __init__(self):
        # ...
        self.alpha = nn.Parameter(torch.randn(4))
        self.beta = self.alpha
        self.gamma = self.alpha
    def forward(self, x):
        return x + self.beta
    ```
    Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module).
    
    This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass.
    
    Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too.
    Pull Request resolved: pytorch#125758
    Approved by: https://github.com/zhxchen17
    pianpwk authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    93d2573 View commit details
    Browse the repository at this point in the history
  2. Enable epilogue fusion benchmarking internally (pytorch#125455)

    Differential Revision: [D56920738](https://our.internmc.facebook.com/intern/diff/D56920738)
    Pull Request resolved: pytorch#125455
    Approved by: https://github.com/Chillee
    eellison authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    4f024c8 View commit details
    Browse the repository at this point in the history
  3. Fanatically correct real tensor cloning for propagate_real_tensors (p…

    …ytorch#126175)
    
    Internal xref:
    https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/
    
    Previously I did it in a crappy way using clone_input in the callback,
    but this results in tensors that don't have quite the same
    size/stride/storage offset and there was an internal test case where
    not having completely accurate information was causing a downstream
    problem in propagation.  So now I make real tensors as similar to their
    fake equivalents as much as possible.  Though... I don't bother with
    autograd lol.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126175
    Approved by: https://github.com/albanD
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a2e8b90 View commit details
    Browse the repository at this point in the history
  4. [reland][dynamo][disable] Move disable impl to its own __call__ method (

    pytorch#126191)
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    Pull Request resolved: pytorch#126191
    Approved by: https://github.com/yoyoyocmu, https://github.com/yanboliang, https://github.com/fegin
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    10b10f2 View commit details
    Browse the repository at this point in the history
  5. [easy][dynamo] Use disable_dynamo for torch.manual_seed (pytorch#126192)

    Pull Request resolved: pytorch#126192
    Approved by: https://github.com/yanboliang
    ghstack dependencies: pytorch#126191
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    f209865 View commit details
    Browse the repository at this point in the history
  6. Revert "[inductor][cpp] GEMM template (infra and fp32) (pytorch#124021)"

    This reverts commit 037615b.
    
    Reverted pytorch#124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](pytorch#124021 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a95b7e9 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    50b88b0 View commit details
    Browse the repository at this point in the history
  8. Remove use of USE_C10D (pytorch#126120)

    As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here.
    
    Pull Request resolved: pytorch#126120
    Approved by: https://github.com/aaronenyeshi
    briancoutinho authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    37f84cb View commit details
    Browse the repository at this point in the history
  9. [torch/distributed] Bugfix: wait for all child procs to exit before c… (

    pytorch#125969)
    
    Observed Problem
    ---------------------
    
    When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully.
    
    This results in misleading warning log messages towards the end of the job like the one below:
    
    ```
    W0510 14:52:48.185934  672413 api.py:513] Closing process 675171 via signal SIGTERM
    W0510 14:52:48.185984  672413 api.py:513] Closing process 675172 via signal SIGTERM
    W0510 14:52:48.186013  672413 api.py:513] Closing process 675174 via signal SIGTERM
    # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ --->
    
    I0510 14:52:48.229119  672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    I0510 14:52:48.229161  672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
    I0510 14:52:48.229395  672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds
    I0510 14:52:48.257544  672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'.
    I0510 14:52:48.568198  672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq
    I0510 14:52:48.568989  672413 distributed.py:202] Finished running `main`
    ```
    
    Root Cause
    ------------------
    
    I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`.
    
    `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`.
    
    `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited.
    
    Fix
    ---------
    
    The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True`
    
    > **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function.
    
    > **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed.
    
    Pull Request resolved: pytorch#125969
    Approved by: https://github.com/d4l3k
    kiukchung authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    00b9974 View commit details
    Browse the repository at this point in the history
  10. Allow for trailing 'a' in sm_arch (pytorch#126185)

    # Summary
    I was getting
    ``` Shell
    File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init
        raise DeferredCudaCallError(msg) from e
    torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a'
    ```
    
    Pull Request resolved: pytorch#126185
    Approved by: https://github.com/Skylion007
    drisspg authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    1dfe2d1 View commit details
    Browse the repository at this point in the history
  11. [pipelining] Add manual pipeline stage (pytorch#126123)

    Add `ManualPipelineStage` under `_PipelineStage.py`
    
    Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible.
    
    Pull Request resolved: pytorch#126123
    Approved by: https://github.com/kwen2501
    H-Huang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ed27236 View commit details
    Browse the repository at this point in the history
  12. Refactor make_fx to better support hop subgraph tracing (pytorch#125267)

    Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff.
    
    Test Plan:
    Existing tests.
    
    Pull Request resolved: pytorch#125267
    Approved by: https://github.com/Chillee
    ydwu4 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    636ea1c View commit details
    Browse the repository at this point in the history
  13. Support trace_subgraph in _MakefxTracer (pytorch#125363)

    Adds trace_subgraph to _MakefxTracer, the motivation is in pytorch#122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph.
    
    **Test Plan:**
    Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes.
    
    Also fixes pytorch#124643
    
    Pull Request resolved: pytorch#125363
    Approved by: https://github.com/Chillee
    ghstack dependencies: pytorch#125267
    ydwu4 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a745003 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    976f0f2 View commit details
    Browse the repository at this point in the history
  15. Set dtype when copying empty tensor (pytorch#126124)

    Summary: Forward fix D57251348
    
    Test Plan: `buck2 test 'fbcode//mode/dev' fbcode//executorch/kernels/test:aten_op_copy_test`
    
    Differential Revision: D57304360
    
    Pull Request resolved: pytorch#126124
    Approved by: https://github.com/bdhirsh
    huydhn authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b3f0fce View commit details
    Browse the repository at this point in the history
  16. [BE] Abstract out strings to top of file (pytorch#125640)

    Summary:
    Move const strings to top of file. This is in preparation of tooling to
    make use of shared constants (e.g. version string). A non-functional change.
    Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change.
    
    Test Plan:
    python test/distributed/test_c10d_nccl.py NCCLTraceTest
    
    Pull Request resolved: pytorch#125640
    Approved by: https://github.com/wconstab
    c-p-i-o authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    aa17484 View commit details
    Browse the repository at this point in the history
  17. [Inductor] Flex attention supports dynamic shape (pytorch#125994)

    ## static shapes perf
    ```
    | Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
    |---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
    | Average |     0.692 |              |             |             |             |            |             |                |
    | Max     |     0.855 |           16 |          16 |        4096 |        4096 |         64 | head_bias   | torch.bfloat16 |
    | Min     |     0.419 |            8 |          16 |         512 |         512 |        256 | noop        | torch.bfloat16 |
    ```
    ## dynamic shapes perf
    ```
    | Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
    |---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
    | Average |     0.670 |              |             |             |             |            |               |                |
    | Max     |     0.864 |           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |
    | Min     |     0.376 |            8 |          16 |         512 |         512 |        256 | relative_bias | torch.bfloat16 |
    ```
    
    Pull Request resolved: pytorch#125994
    Approved by: https://github.com/Chillee
    yanboliang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    685b207 View commit details
    Browse the repository at this point in the history
  18. Add missing type uint16, uint32, and uint64 to TensorHash in LTC. (py…

    …torch#125972)
    
    If I do:
    
    ```
    xla_device = xm.xla_device()
    xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device)
    ```
    
    I got the error:
    
    ```
    RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16
    ```
    
    This PR intends to fix this issue.
    The data type can be found in pytorch/c10/core/ScalarType.h.
    Pull Request resolved: pytorch#125972
    Approved by: https://github.com/JackCaoG
    vanbasten23 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b959b4f View commit details
    Browse the repository at this point in the history
  19. Add some type annotations to python stream and event classes (pytorch…

    …#126171)
    
    For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support.
    
    Pull Request resolved: pytorch#126171
    Approved by: https://github.com/ezyang
    cyyever authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    86d560a View commit details
    Browse the repository at this point in the history
  20. Configuration menu
    Copy the full SHA
    074173b View commit details
    Browse the repository at this point in the history
  21. [Inductor] Skip test_nll_loss_backward for intel GPU. (pytorch#126157)

    Skip this test case due to unaligned behavior to CUDA for Triton `mask_load`. We submitted issue pytorch#126173 to elaborate on the root cause. We intend to skip this case for XPU first as we need to take some time to fix the issue and have full validation to update the Triton commit pin for Intel GPU.
    
    Pull Request resolved: pytorch#126157
    Approved by: https://github.com/EikanWang, https://github.com/peterbell10, https://github.com/desertfire
    etaf authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    d0688dd View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    a749763 View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    11aea9e View commit details
    Browse the repository at this point in the history
  24. Adjust number of repeats when using --warm-start-latency benchmark fl…

    …ag (pytorch#125917)
    
    Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat
    
    Pull Request resolved: pytorch#125917
    Approved by: https://github.com/desertfire
    masnesral authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    32fdb75 View commit details
    Browse the repository at this point in the history
  25. [benchmarking] Suppress csv creation on cold-start phase of --warm-st…

    …art-latency (pytorch#125953)
    
    Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.
    
    Pull Request resolved: pytorch#125953
    Approved by: https://github.com/desertfire
    ghstack dependencies: pytorch#125917
    masnesral authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    38e2661 View commit details
    Browse the repository at this point in the history
  26. Add a few "warm start" smoketest runs to CI (pytorch#125955)

    Summary:
    Not sure which to choose, so my criteria was:
    1) We care about huggingface as part of internal milestones
    2) This handful of models seems to particularly benefite from caching
    Pull Request resolved: pytorch#125955
    Approved by: https://github.com/desertfire
    ghstack dependencies: pytorch#125917, pytorch#125953
    masnesral authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    bc9f57b View commit details
    Browse the repository at this point in the history
  27. [audio hash update] update the pinned audio hash (pytorch#126248)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
    Update the pinned audio hash.
    Pull Request resolved: pytorch#126248
    Approved by: https://github.com/pytorchbot
    pytorchupdatebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ce7a832 View commit details
    Browse the repository at this point in the history
  28. Add force_disable_caches to the docs (pytorch#126184)

    Pull Request resolved: pytorch#126184
    Approved by: https://github.com/msaroufim
    oulgen authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3512895 View commit details
    Browse the repository at this point in the history
  29. [inductor][cpp] GEMM template (infra and fp32) (pytorch#124021)

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC pytorch#125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    Pull Request resolved: pytorch#124021
    Approved by: https://github.com/jansel
    jgong5 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b743f89 View commit details
    Browse the repository at this point in the history
  30. [CUDA] [CI] Add cu124 docker images (pytorch#125944)

    Fixes issues encountered in pytorch#121956
    
    Pull Request resolved: pytorch#125944
    Approved by: https://github.com/atalman
    nWEIdia authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    170380e View commit details
    Browse the repository at this point in the history
  31. Don't assert about pending when we are peeking (pytorch#126239)

    Internal xref https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/
    
    In particular, when we're collecting forward metadata, we aren't going
    to discharge any of the pending, so we'll be continuously collecting
    more and more pending symbols that we may not be able to resolve.  This
    is fine.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126239
    Approved by: https://github.com/lezcano
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    f33cc7a View commit details
    Browse the repository at this point in the history
  32. [AOTI][torchgen] Update NativeFunctionsGroup mapping (pytorch#125962)

    Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping.
    Pull Request resolved: pytorch#125962
    Approved by: https://github.com/chenyang78
    desertfire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8dc8ae9 View commit details
    Browse the repository at this point in the history
  33. [AOTI][torchgen] Add a few more fallback ops (pytorch#126013)

    Summary: They appear in some unit tests.
    
    Pull Request resolved: pytorch#126013
    Approved by: https://github.com/chenyang78
    ghstack dependencies: pytorch#125962
    desertfire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    5bc525c View commit details
    Browse the repository at this point in the history
  34. [Memory Snapshot] Add recordAnnotations to capture record_function an…

    …notations (pytorch#124179)
    
    Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.
    
    Test Plan:
    CI
    
    New Snapshot Generated:
    devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle
    
    Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations:
    ```
    [[{'action': 'user_defined',
       'addr': 0,
       'size': 0,
       'stream': 0,
       'time_us': 1713558427168556,
       'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]},
      {'action': 'user_defined',
       'addr': 0,
       'size': 0,
       'stream': 0,
       'time_us': 1713558427168738,
       'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]},
      {'action': 'user_defined',
       'addr': 0,
       'size': 0,
       'stream': 0,
       'time_us': 1713558427168865,
       'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]},
      {'action': 'user_defined',
       'addr': 0,
       'size': 0,
       'stream': 0,
       'time_us': 1713558427168920,
       'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]},
      {'action': 'alloc',
       'addr': 140166073581568,
       'size': 3211264,
       'stream': 0,
       'time_us': 1713558427172978,
       'frames': [{'name': '_conv_forward',
         'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv
    ```
    
    Differential Revision: D55941362
    
    Pulled By: aaronenyeshi
    
    Pull Request resolved: pytorch#124179
    Approved by: https://github.com/zdevito
    aaronenyeshi authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    4e3dfb0 View commit details
    Browse the repository at this point in the history
  35. Enable UFMT on test/test_fake_tensor.py, `test/test_flop_counter.py…

    …` and some files (pytorch#125747)
    
    Part of: pytorch#123062
    
    Ran lintrunner on:
    
    - test/test_fake_tensor.py
    - test/test_flop_counter.py
    - test/test_function_schema.py
    - test/test_functional_autograd_benchmark.py
    - test/test_functional_optim.py
    - test/test_functionalization_of_rng_ops.py
    
    Detail:
    
    ```bash
    $ lintrunner -a --take UFMT --all-files
    ok No lint issues.
    Successfully applied all patches.
    ```
    
    Pull Request resolved: pytorch#125747
    Approved by: https://github.com/malfet
    shink authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    68c29aa View commit details
    Browse the repository at this point in the history
  36. [Inductor] Generalize new introduced device-bias code. (pytorch#126261)

    We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors.
    
    Pull Request resolved: pytorch#126261
    Approved by: https://github.com/EikanWang, https://github.com/peterbell10
    etaf authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    cd60801 View commit details
    Browse the repository at this point in the history
  37. [export] Cover more cases to copy tensor conversions. (pytorch#125628)

    Summary:
    Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: pytorch/PiPPy#1104 (comment)
    
    I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here.
    
    Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion
    
    Differential Revision: D56951634
    
    Pull Request resolved: pytorch#125628
    Approved by: https://github.com/tugsbayasgalan
    zhxchen17 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a48463e View commit details
    Browse the repository at this point in the history
  38. Revert "[Memory Snapshot] Add recordAnnotations to capture record_fun…

    …ction annotations (pytorch#124179)"
    
    This reverts commit 187aeae.
    
    Reverted pytorch#124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 https://hud.pytorch.org/pytorch/pytorch/commit/187aeaeabf612824c2d0e9be72f80ce6612760d4, test was skipped due to bad TD ([comment](pytorch#124179 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ff266cd View commit details
    Browse the repository at this point in the history
  39. [CI] 3 procs non cuda (pytorch#125932)

    Too lazy to figure out actual time reduction here, I'll figure it out later.  Also I'd rather get an average of a couple of runs on trunk rather than just this one PR
    Things got faster. Source? Trust me bro
    
    * rel to pytorch#125598
    
    Pull Request resolved: pytorch#125932
    Approved by: https://github.com/ZainRizvi
    clee2000 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    e49ccce View commit details
    Browse the repository at this point in the history
  40. Foward fix lint after pytorch#125747 (pytorch#126295)

    Fixes #ISSUE_NUMBER
    
    Pull Request resolved: pytorch#126295
    Approved by: https://github.com/atalman
    clee2000 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0df5ed0 View commit details
    Browse the repository at this point in the history
  41. Faster int8 quantized (pytorch#125704)

    Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) )
    
    Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`)
    
    Before the change, on M2 Pro I get 50 tokens per sec
    After adding a very naive
    ```metal
    template<typename T>
    kernel void int8pack_mm(
        constant T                 * A              [[buffer(0)]],
        constant char              * B              [[buffer(1)]],
        constant T                 * scales         [[buffer(2)]],
        device   T                 * outputData     [[buffer(3)]],
        constant uint3             & sizes          [[buffer(4)]],
        uint                         thread_index   [[thread_position_in_grid]]) {
        const uint lda = sizes.y;
        const uint ldc = sizes.z;
        const uint m = thread_index / sizes.z; // 0..sizes.x-1
        const uint n = thread_index % sizes.z; // 0..sizes.z-1
        constant T *A_ptr = A + m * lda;
        constant char *B_ptr = B + n * lda;
    
        float rc = 0.0;
        for(uint k = 0; k < sizes.y;  k++) {
          const auto a_val = float(A_ptr[k]);
          const auto b_val = float(B_ptr[k]);
          rc += a_val * b_val;
        }
        outputData[thread_index] = T(rc * float(scales[n]));
    }
    ```
    Perf dropped down to sad 15 tokens per seconds.
    Replacing inner loop with vectorized operations
    ```metal
        float rc = 0.0;
        for(uint k = 0; k < sizes.y/4;  k++) {
          const auto a_val = float4(A_ptr[k]);
          const auto b_val = float4(B_ptr[k]);
          rc += dot(a_val, b_val);
        }
    ```
    Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf.
    
    Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain)
    
    There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in https://github.com/ml-explore/mlx/blob/631dfbe67309fb630795cd612739cbe54c75e222/mlx/backend/metal/kernels/gemv.metal#L184
    which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test
    Pull Request resolved: pytorch#125704
    Approved by: https://github.com/mikekgfb
    malfet authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    12f2960 View commit details
    Browse the repository at this point in the history
  42. [DTensor] Turn on foreach implementation of optimizer for DTensor by …

    …default (pytorch#123394)
    
    Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users.
    
    Pull Request resolved: pytorch#123394
    Approved by: https://github.com/wanchaol
    wz337 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9b24e7f View commit details
    Browse the repository at this point in the history
  43. [Dynamo] SizeVariable supports hasattr (pytorch#126222)

    Fixes #ISSUE_NUMBER
    
    Pull Request resolved: pytorch#126222
    Approved by: https://github.com/williamwen42, https://github.com/anijain2305
    yanboliang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    22c50a3 View commit details
    Browse the repository at this point in the history
  44. CMake: Improve check and report of Magma (pytorch#117858)

    - Only search for magma if it is used (GPU builds)
    - Don't report it was not found when it isn't searched for
    - Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported)
    
    Pull Request resolved: pytorch#117858
    Approved by: https://github.com/malfet
    Flamefire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    35117bf View commit details
    Browse the repository at this point in the history
  45. [onnx.export] Avoid linear loop over symbol_dim_map (pytorch#123029)

    This PR is part of an effort to speed up torch.onnx.export (pytorch#121422).
    
    - Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export.
    - Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity.
    - This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it.
    - Resolves (9) in pytorch#121422.
    
    (partial fix of pytorch#121422)
    
    Pull Request resolved: pytorch#123029
    Approved by: https://github.com/justinchuby
    gustavla authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3ca1ae4 View commit details
    Browse the repository at this point in the history
  46. [easy] Remove aot_config from pre_compile returns, rename fw_metadata…

    … in post_compile (pytorch#125854)
    
    This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor.
    
    As @aorenste  points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile.
    
    Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves.
    
    Currently, wrappers come in two categories:
    
    1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile
    2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata.
    
    So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR.
    
    Pull Request resolved: pytorch#125854
    Approved by: https://github.com/aorenste, https://github.com/bdhirsh
    jamesjwu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    39b2795 View commit details
    Browse the repository at this point in the history
  47. Reland '[Inductor] GEMM shape padding improvements (pytorch#118522)' (p…

    …ytorch#125773)
    
    Relanding just the pad in a single pass portion of [the pr](pytorch#118522). Not including
    the transpose logic:
    
    This was previously accepted and reviewed.
    
    Pull Request resolved: pytorch#125773
    Approved by: https://github.com/shunting314
    ghstack dependencies: pytorch#125772
    eellison authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b3b9f72 View commit details
    Browse the repository at this point in the history
  48. Skip padding cost of fusible/planable inputs (pytorch#125780)

    For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1`
    
    Pull Request resolved: pytorch#125780
    Approved by: https://github.com/shunting314
    ghstack dependencies: pytorch#125772, pytorch#125773
    eellison authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0ce75f9 View commit details
    Browse the repository at this point in the history
  49. Forward fix failures for torch.export switch to predispatch (pytorch#…

    …126081)
    
    Summary:
    Fixes:
    - executorch test
    - torchrec test
    
    Test Plan: CI
    
    Differential Revision: D57282304
    
    Pull Request resolved: pytorch#126081
    Approved by: https://github.com/angelayi
    tugsbayasgalan authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0f2db1c View commit details
    Browse the repository at this point in the history
  50. Beef up error message for pending assert failure (pytorch#126212)

    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126212
    Approved by: https://github.com/Skylion007
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6b733b2 View commit details
    Browse the repository at this point in the history
  51. Enable UFMT format on test/test_utils.py (pytorch#125996)

    Fixes some files in pytorch#123062
    
    Run lintrunner on files:
    test/test_utils.py
    
    ```bash
    $ lintrunner -a --take UFMT --all-files
    ok No lint issues.
    Successfully applied all patches.
    Pull Request resolved: pytorch#125996
    Approved by: https://github.com/ezyang
    hippocookie authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    1480537 View commit details
    Browse the repository at this point in the history
  52. Fix aarch64 debug build with GCC (pytorch#126290)

    By working around GCCs quirks in instantiating templates that require immediate values.
    Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0)
    
    Fixes pytorch#126283
    
    Pull Request resolved: pytorch#126290
    Approved by: https://github.com/atalman, https://github.com/seemethere
    malfet authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    adf9cc7 View commit details
    Browse the repository at this point in the history
  53. Fix public binding to actually traverse modules (pytorch#126103)

    The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
    There is an unused function just above that handles that, so I guess this is what was supposed to be called.
    
    Pull Request resolved: pytorch#126103
    Approved by: https://github.com/suo
    albanD authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    147ba73 View commit details
    Browse the repository at this point in the history
  54. [FSDP] Fixed docs for inter/intra node PG helpers (pytorch#126288)

    1. This fixes an issue where we had 9 ranks in one node and 7 in the other.
    2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`.
    
    Pull Request resolved: pytorch#126288
    Approved by: https://github.com/weifengpy
    awgu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9397380 View commit details
    Browse the repository at this point in the history
  55. Revert "Fix aarch64 debug build with GCC (pytorch#126290)"

    This reverts commit a961e1a.
    
    Reverted pytorch#126290 on behalf of https://github.com/malfet due to Indeed lint is broken :/ ([comment](pytorch#126290 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    921a824 View commit details
    Browse the repository at this point in the history
  56. Parametrize test_dim_reduction (pytorch#126292)

    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126292
    Approved by: https://github.com/Skylion007
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0658670 View commit details
    Browse the repository at this point in the history
  57. [DCP] overwrites existing checkpoint by default (pytorch#125877)

    Checks for existing checkpoints and overwrites, based on an `overwrite` flag
    
    Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/)
    
    Pull Request resolved: pytorch#125877
    Approved by: https://github.com/fegin
    LucasLLC authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b5e6220 View commit details
    Browse the repository at this point in the history
  58. Fix public api allowlist logical merge conflict (pytorch#126321)

    Skip the newly added bad API from pytorch#126212 to keep CI green.
    
    Pull Request resolved: pytorch#126321
    Approved by: https://github.com/ezyang
    albanD authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a0a6bbc View commit details
    Browse the repository at this point in the history
  59. 2 rocm shards on trunk.yml (pytorch#125933)

    after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk
    
    Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3
    
    Pull Request resolved: pytorch#125933
    Approved by: https://github.com/ZainRizvi
    clee2000 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    910f26f View commit details
    Browse the repository at this point in the history
  60. [FSDP2] allow meta tensors during loading state dict and cpu offloadi…

    …ng (pytorch#126267)
    
    unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py``
    
    with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error
    
    Pull Request resolved: pytorch#126267
    Approved by: https://github.com/awgu
    weifengpy authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    5b4dea2 View commit details
    Browse the repository at this point in the history
  61. [dynamo] Detect monkeypatching on nn module forward method (pytorch#1…

    …26203)
    
    An alternative was pytorch#124975. Though it was safer because it was adding guards for every inlined function, it was causing guard overhead for a few models of > 20%.  The overhead of this PR is minimal for the common unpatched case.
    
    Fixes an internal issue - [fb.workplace.com/groups/1075192433118967/permalink/1411067766198097](https://fb.workplace.com/groups/1075192433118967/permalink/1411067766198097/)
    
    Pull Request resolved: pytorch#126203
    Approved by: https://github.com/ezyang
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3a3f8a9 View commit details
    Browse the repository at this point in the history
  62. [onnx.export] Avoid unnecessary copy of debug_names (pytorch#123026)

    This PR is part of an effort to speed up torch.onnx.export (pytorch#121422).
    
    - The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not.
    - However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior.
    - Resolves (2) in pytorch#121422.
    Pull Request resolved: pytorch#123026
    Approved by: https://github.com/justinchuby
    gustavla authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    569ee1e View commit details
    Browse the repository at this point in the history
  63. Configuration menu
    Copy the full SHA
    6243a43 View commit details
    Browse the repository at this point in the history
  64. Improve Storage copy_ size mismatch error message (pytorch#126280)

    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126280
    Approved by: https://github.com/mikaylagawarecki
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9e2b899 View commit details
    Browse the repository at this point in the history
  65. Configuration menu
    Copy the full SHA
    0d9def0 View commit details
    Browse the repository at this point in the history
  66. Remove Caffe2 python code (pytorch#126035)

    Follows the recent changes of Caffe2.
    
    Pull Request resolved: pytorch#126035
    Approved by: https://github.com/r-barnes, https://github.com/Skylion007
    cyyever authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    eb5e9ed View commit details
    Browse the repository at this point in the history
  67. Enable UFMT on test/test_datapipe.py (pytorch#124994)

    Part of: pytorch#123062
    
    Ran lintrunner on:
    
    - `test/test_datapipe.py`
    
    Detail:
    
    ```bash
    $ lintrunner -a --take UFMT --all-files
    ok No lint issues.
    Successfully applied all patches.
    ```
    
    Co-authored-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: pytorch#124994
    Approved by: https://github.com/mikaylagawarecki
    shink authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    05eff35 View commit details
    Browse the repository at this point in the history
  68. Remove expected failure in test_eager_transforms.py (pytorch#125883)

    Seems to be supported now
    
    CC @tinglvv @nWEIdia @Aidyn-A
    
    Pull Request resolved: pytorch#125883
    Approved by: https://github.com/Chillee, https://github.com/Aidyn-A
    eqy authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    75add2f View commit details
    Browse the repository at this point in the history
  69. [optim] Fix: wrong ASGD implementation (pytorch#125440)

    > previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.
    
    - [X] Ill assumption that every param will have the same step.
    - [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
    Pull Request resolved: pytorch#125440
    Approved by: https://github.com/janeyx99
    david20571015 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    079d3f5 View commit details
    Browse the repository at this point in the history
  70. Fix triton codegen main do_bench_gpu import error (pytorch#126213)

    Summary:
    Encountered module import error when running triton kernel file.
    
    The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils
    
    However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench".
    
    Test Plan:
    LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt                 -c=python.package_style=inplace                 -c fbcode.enable_gpu_sections=true                 -c fbcode.platform=platform010                 -c fbcode.nvcc_arch=v100,a100,h100                 -c fbcode.split-dwarf=true                 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark                 --  --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 | tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt
    
    bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py
    
    file ran successfully
    
    Differential Revision: D57345619
    
    Pull Request resolved: pytorch#126213
    Approved by: https://github.com/shunting314
    adelesun authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0e22566 View commit details
    Browse the repository at this point in the history
  71. Configuration menu
    Copy the full SHA
    8cc3b81 View commit details
    Browse the repository at this point in the history
  72. [dynamo] graph break on issubclass call with non-const args (pytorch#…

    …125943)
    
    Fixes pytorch#125942
    
    Pull Request resolved: pytorch#125943
    Approved by: https://github.com/jansel
    ghstack dependencies: pytorch#125882
    williamwen42 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    972f76f View commit details
    Browse the repository at this point in the history
  73. [dynamo] fix pytorch#93624 (pytorch#125945)

    Fixes pytorch#93624 but also requires jcmgray/autoray#20 to be fixed.
    
    Pull Request resolved: pytorch#125945
    Approved by: https://github.com/jansel
    ghstack dependencies: pytorch#125882, pytorch#125943
    williamwen42 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2524635 View commit details
    Browse the repository at this point in the history
  74. Configuration menu
    Copy the full SHA
    d3d25a3 View commit details
    Browse the repository at this point in the history
  75. [FSDP2] support fully_shard(model_on_meta, cpu_offload) (pytorch#126305)

    support fully_shard(model_on_meta, cpu_offload) when fully_shard is placed outside of `torch.device("meta")`
    
    Pull Request resolved: pytorch#126305
    Approved by: https://github.com/awgu
    ghstack dependencies: pytorch#126267
    weifengpy authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    592bc1f View commit details
    Browse the repository at this point in the history
  76. Add VariableTracker.debug_repr (pytorch#126299)

    Now you can print arbitrary values at compile time with
    comptime.print()
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126299
    Approved by: https://github.com/jansel
    ghstack dependencies: pytorch#126292
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3fe0c6d View commit details
    Browse the repository at this point in the history
  77. Also remove compile_time_strobelight_meta frame when generating stack (

    …pytorch#126289)
    
    I think I also need to fix this in fbcode, leaving that for future work.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126289
    Approved by: https://github.com/yanboliang
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    421b23d View commit details
    Browse the repository at this point in the history
  78. Make propagate_real_tensor more safe (pytorch#126281)

    Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/
    
    There a few improvements here, which luckily fix some xfails:
    
    * In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
    * We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
    * I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126281
    Approved by: https://github.com/Skylion007
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    db73a01 View commit details
    Browse the repository at this point in the history
  79. Switched from parameter in can_cast to from_. (pytorch#126030)

    Fixes pytorch#126012.
    
    `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.
    
    If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.
    
    Pull Request resolved: pytorch#126030
    Approved by: https://github.com/albanD
    tringwald authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    fed9d93 View commit details
    Browse the repository at this point in the history
  80. [easy][dynamo][inline-inbuilt-nn-modules] Change test to check for pa…

    …rams (pytorch#126316)
    
    Pull Request resolved: pytorch#126316
    Approved by: https://github.com/williamwen42
    ghstack dependencies: pytorch#126303
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8f9fa47 View commit details
    Browse the repository at this point in the history
  81. [Export] Allow ExportedProgram to take empty decomp table (pytorch#12…

    …6142)
    
    **As title.**
    Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](https://github.com/pytorch/pytorch/blob/04877dc430a6e93765471b28f422bf3e81d02c9e/torch/_functorch/aot_autograd.py#L456-459) only.
    
    **Motivation**
    We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier.
    
    **Testing**
    CI
    Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](https://github.com/pytorch/pytorch/blob/04877dc430a6e93765471b28f422bf3e81d02c9e/torch/onnx/_internal/exporter.py#L817) or not.
    Pull Request resolved: pytorch#126142
    Approved by: https://github.com/angelayi
    StellarrZ authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    7a4c6b9 View commit details
    Browse the repository at this point in the history
  82. [optim] add fused_adagrad support for CPU device (pytorch#124905)

    Support fused_sgd_kernel support for CPU.
    
    ## Bench result:
    32 core/sockets ICX
    Test Scripts:
    https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
    https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
    ```
    Tensor Size: 262144, Num Tensor 4, Num Threads: 1
    _single_tensor_adagrad time: 0.2500 seconds
    _fused_adagrad time: 0.0933 seconds
    Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
    _single_tensor_adagrad time: 2.8819 seconds
    _fused_adagrad time: 1.7591 seconds
    ```
    ## Test Plan:
    ```
    python test_optim.py -k test_fused_matches_forloop
    python test_optim.py -k test_fused_large_tensor
    python test_optim.py -k test_can_load_older_state_dict
    python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
    python test_torch.py -k test_grad_scaling_autocast_fused
    python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
    ```
    
    Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
    Pull Request resolved: pytorch#124905
    Approved by: https://github.com/jgong5, https://github.com/janeyx99
    zhuhaozhe authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6944593 View commit details
    Browse the repository at this point in the history
  83. [Inductor][Flex-attention] Make num_head support dynamic (pytorch#126342

    )
    
    Fixes #ISSUE_NUMBER
    
    Pull Request resolved: pytorch#126342
    Approved by: https://github.com/drisspg
    yanboliang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    7754cc1 View commit details
    Browse the repository at this point in the history
  84. [dynamo][inline-inbuilt-nn-modules] Change test to not depend on id o…

    …f mod instance (pytorch#126314)
    
    Pull Request resolved: pytorch#126314
    Approved by: https://github.com/williamwen42
    ghstack dependencies: pytorch#126303, pytorch#126316
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    009b5b6 View commit details
    Browse the repository at this point in the history
  85. [dynamo][inline-inbuilt-nn-modules] Add and update test_modules.py fo…

    …r nlining work (pytorch#126327)
    
    Pull Request resolved: pytorch#126327
    Approved by: https://github.com/williamwen42
    ghstack dependencies: pytorch#126303, pytorch#126316, pytorch#126314
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ae2fdc8 View commit details
    Browse the repository at this point in the history
  86. [inductor] [FX graph cache] Ignore unbacked symints in guards express…

    …ion (pytorch#126251)
    
    Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class.
    
    Pull Request resolved: pytorch#126251
    Approved by: https://github.com/peterbell10
    masnesral authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    675c49f View commit details
    Browse the repository at this point in the history
  87. Revert "Switched from parameter in can_cast to from_. (pytorch#126030)"

    This reverts commit 06d6bb4.
    
    Reverted pytorch#126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with pytorch#125995.  Please help rebase and I will reland the change ([comment](pytorch#126030 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    930e757 View commit details
    Browse the repository at this point in the history
  88. [inductor][cpp] epilogue support for gemm template (pytorch#126019)

    As part of pytorch#125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.
    
    Pull Request resolved: pytorch#126019
    Approved by: https://github.com/jansel
    jgong5 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    88643f1 View commit details
    Browse the repository at this point in the history
  89. [TEST][Dynamo] fix test_deviceguard.py (pytorch#126240)

    The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied.
    
    Pull Request resolved: pytorch#126240
    Approved by: https://github.com/jansel
    Aidyn-A authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    4417b4c View commit details
    Browse the repository at this point in the history
  90. Revert "Remove deprecated _aminmax operator (pytorch#125995)"

    This reverts commit 0116ffa.
    
    Reverted pytorch#125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](pytorch#125995 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    f30d086 View commit details
    Browse the repository at this point in the history
  91. Configuration menu
    Copy the full SHA
    b8c08b6 View commit details
    Browse the repository at this point in the history
  92. [DeviceMesh] Fix hash and eq not match (pytorch#123572)

    Fixes pytorch#121799
    
    We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh.
    Examples can be found here: pytorch#121799
    
    Also need this to unblock pytorch#123394
    
    Pull Request resolved: pytorch#123572
    Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu
    wz337 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    479f3f9 View commit details
    Browse the repository at this point in the history
  93. [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilog…

    …ue fusion (pytorch#126068)
    
    As part of pytorch#125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
    
    Pull Request resolved: pytorch#126068
    Approved by: https://github.com/jansel
    ghstack dependencies: pytorch#126019
    jgong5 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    e974908 View commit details
    Browse the repository at this point in the history
  94. Initial implementation of AdaRound (pytorch#126153)

    Summary:
    This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568
    
    This algorithm is going to be used by multiple people, hence we need make it official implementation.
    
    Differential Revision: D57227565
    
    Pull Request resolved: pytorch#126153
    Approved by: https://github.com/jerryzh168
    kwanghoon-meta authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a4250cc View commit details
    Browse the repository at this point in the history
  95. Revert "[optim] Fix: wrong ASGD implementation (pytorch#125440)"

    This reverts commit 2c5ad9a.
    
    Reverted pytorch#125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](pytorch#125440 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    195d01c View commit details
    Browse the repository at this point in the history
  96. Revert "Initial implementation of AdaRound (pytorch#126153)"

    This reverts commit 175c18a.
    
    Reverted pytorch#126153 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit because there are more than one lint issues, torch/optim/asgd.py is just the last one ([comment](pytorch#126153 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    4397921 View commit details
    Browse the repository at this point in the history
  97. Add Lowering for FlexAttention Backwards (pytorch#125515)

    # Summary
    #### What does this PR do?
    It enables Inductor to actually generate the fused flex attention kernel for the backwards
    
    I did some other things along the way:
    - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
    - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
    - I didnt correctly register the decomp table + IndexMode when I landed: pytorch#123902, this remedies that.
    - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
    - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
    - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
    - I updated the benchmark to also profile bwds performance
    
    ### Benchmark Numbers:
    _The current implementation is not parallelizing over ctx length in the bwd_
    FWD Speedups
    
    | Type    |   Speedup | shape              | score_mod   | dtype          |
    |---------|-----------|--------------------|-------------|----------------|
    | Average |     0.991 |                    |             |                |
    | Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
    | Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |
    
    BWD Speedups
    
    | Type    |   Speedup | shape              | score_mod   | dtype          |
    |---------|-----------|--------------------|-------------|----------------|
    | Average |     0.291 |                    |             |                |
    | Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
    | Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |
    
    <details>
    
    <summary>Full Data</summary>
    
    | shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
    |---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
    | (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
    | (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
    | (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
    | (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
    | (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
    | (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
    | (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
    | (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
    | (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
    | (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
    | (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
    | (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
    | (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
    | (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
    | (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
    | (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
    | (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
    | (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
    | (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
    | (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
    | (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
    | (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
    | (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
    | (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
    | (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
    | (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
    | (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
    | (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
    | (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
    | (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
    | (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
    | (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
    | (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
    | (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
    | (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
    | (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
    | (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
    | (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
    | (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
    | (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
    | (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
    | (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
    | (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
    | (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
    | (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
    | (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
    | (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
    | (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
    | (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
    | (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
    | (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
    | (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
    | (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
    | (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
    | (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
    | (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
    | (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
    | (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
    | (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
    | (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
    | (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
    | (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
    | (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
    | (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
    | (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
    | (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
    | (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
    | (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
    | (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
    | (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
    | (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
    | (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
    | (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
    | (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
    | (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
    | (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
    | (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
    | (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
    | (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
    | (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
    | (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
    | (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
    | (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
    | (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
    | (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
    | (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
    | (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
    | (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
    | (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
    | (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
    | (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
    | (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
    | (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
    | (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
    | (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
    | (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
    | (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
    | (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
    | (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
    | (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
    | (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
    | (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
    | (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
    | (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
    | (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
    | (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
    | (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
    | (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |
    
    </details>
    
    Pull Request resolved: pytorch#125515
    Approved by: https://github.com/Chillee
    drisspg authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    22db67f View commit details
    Browse the repository at this point in the history
  98. [dynamo] Delete extra testing of cpp guard manager (pytorch#126343)

    CPP guard manager has been on for a few weeks now. This separate testing was part of phasing when the cpp guard manager was not enabled. Now this is not needed.
    
    Pull Request resolved: pytorch#126343
    Approved by: https://github.com/williamwen42
    ghstack dependencies: pytorch#126303, pytorch#126316, pytorch#126314, pytorch#126327
    anijain2305 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8dced59 View commit details
    Browse the repository at this point in the history
  99. fix the device type for with_comms decorator (pytorch#125798)

    found by @yifuwang, it looks like we are wrongly using
    self.device_type="cuda" for gloo backend, which are triggering some
    flakiness. i.e. pytorch#125366
    
    Pull Request resolved: pytorch#125798
    Approved by: https://github.com/yifuwang
    wanchaol authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    c73f90c View commit details
    Browse the repository at this point in the history
  100. Add mode to MemoryDep to track atomic accumulates (pytorch#123223)

    And allow fusion of buffers where writes are only atomic accumulates.
    This allows fusing of ops like
    
      _unsafe_index_put(_unsafe_index_put(a, ...), ...)
    
    Pull Request resolved: pytorch#123223
    Approved by: https://github.com/peterbell10
    isuruf authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9f09eae View commit details
    Browse the repository at this point in the history
  101. [c10d] Add an option for NAN check on every collective (pytorch#125726)

    Summary:
    The NAN CHECK is done through device side assert without copying needed
    from GPU to CPU
    Test Plan:
    Unit test for collectives that should experience run time error
    
    (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
    test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
    failed.
    [rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
    checkForNan: device-side assert triggered
    
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
    failed.
    /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
    checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
    failed.
    [rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
    checkForNan: device-side assert triggered
    
    .
    ----------------------------------------------------------------------
    Ran 1 test in 7.723s
    
    OK
    
    Tags:
    
    Pull Request resolved: pytorch#125726
    Approved by: https://github.com/kwen2501
    shuqiangzhang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2ba6d37 View commit details
    Browse the repository at this point in the history
  102. Generate runtime asserts when propagate real tensor is used (pytorch#…

    …126287)
    
    This means that propagate real tensor is no longer unsound: if the
    route we took at compile time diverges with runtime, you will get a
    runtime assert.
    
    Also add structured trace logs for these.
    
    Also fix bug where xreplace with int range is not guaranteed to return
    a sympy expression.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126287
    Approved by: https://github.com/Skylion007
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8989a88 View commit details
    Browse the repository at this point in the history
  103. [ez] fix exported diff mismatch (pytorch#126357)

    Fixes the following issue:
    D55803461 differs from the exported PR: pytorch#123658
    
    ⚠️ this PR needs to be skipped on diff train!
    
    Pull Request resolved: pytorch#126357
    Approved by: https://github.com/huydhn, https://github.com/fegin
    izaitsevfb authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    1473472 View commit details
    Browse the repository at this point in the history
  104. [Add sliding window attention bias] (pytorch#126061)

    Summary:
    This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met.
    
    These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own.
    
    Test Plan:
    Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /*window_size_left*/
    
    Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test
    
    Differential Revision: D56938087
    
    Pull Request resolved: pytorch#126061
    Approved by: https://github.com/drisspg, https://github.com/desertfire
    lvaleriu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    64fb6ed View commit details
    Browse the repository at this point in the history
  105. Fix lint failures coming from pytorch#126035 (pytorch#126378)

    MYPY somehow shows lots of local failures for me.  The issue is tracked in pytorch#126361.  This is only to keep trunk sane.  These two line were added by pytorch#126035 as an attempt to fix lint there, but didn't seem to help.
    Pull Request resolved: pytorch#126378
    Approved by: https://github.com/kit1980
    huydhn authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    7dab5f7 View commit details
    Browse the repository at this point in the history
  106. [1/N] Non-Tensor: Scalar Support: Enable aot compile to support aten …

    …operations with scalar input like alpha (pytorch#124177)
    
    Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`.  Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - https://github.com/pytorch/pytorch/blob/0f6ce45bcbd7026c00da43db0317ede10830378b/torch/_inductor/compile_fx.py#L1048
    
    This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters.
    
    Captured graph for `torch.add(a, b, alpha=2.0)`
    
    ```
    opcode         name      target           args              kwargs
    -------------  --------  ---------------  ----------------  --------------
    placeholder    arg0_1    arg0_1           ()                {}
    placeholder    arg1_1    arg1_1           ()                {}
    call_function  add       aten.add.Tensor  (arg0_1, arg1_1)  {'alpha': 2.0}
    output         output_1  output           ((add,),)         {}
    ```
    
    Pull Request resolved: pytorch#124177
    Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5
    EikanWang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    fb2c753 View commit details
    Browse the repository at this point in the history
  107. [Doc] Add deprecated autocast comments for doc (pytorch#126062)

    # Motivation
    We generalize a device-agnostic API `torch.amp.autocast` in [pytorch#125103](pytorch#125103).  After that,
    - `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and
    - `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)`
    
    no matter in eager mode or JIT mode.
    Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API.
    
    Pull Request resolved: pytorch#126062
    Approved by: https://github.com/eqy, https://github.com/albanD
    guangyey authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    45d93f9 View commit details
    Browse the repository at this point in the history
  108. Revert "Fix lint failures coming from pytorch#126035 (pytorch#126378)"

    This reverts commit 5fa1f4c.
    
    Reverted pytorch#126378 on behalf of https://github.com/huydhn due to Trying to add yet another lint fix from https://hud.pytorch.org/pr/pytorch/pytorch/126357 and will reland this ([comment](pytorch#126378 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    75289f2 View commit details
    Browse the repository at this point in the history
  109. Revert "Add Lowering for FlexAttention Backwards (pytorch#125515)"

    This reverts commit 95b9e98.
    
    Reverted pytorch#125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory https://hud.pytorch.org/pytorch/pytorch/commit/95b9e981c3ab68fc17f78b8a6bbfd9569745ae4c ([comment](pytorch#125515 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    dd2f8d1 View commit details
    Browse the repository at this point in the history
  110. Fix lint failures coming from pytorch#126035 (pytorch#126378)

    MYPY somehow shows lots of local failures for me.  The issue is tracked in pytorch#126361.  This is only to keep trunk sane.  These two line were added by pytorch#126035 as an attempt to fix lint there, but didn't seem to help.
    
    Pull Request resolved: pytorch#126378
    Approved by: https://github.com/kit1980
    huydhn authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    adc0551 View commit details
    Browse the repository at this point in the history
  111. [Traceable FSDP2] Add all_gather_into_tensor out variant (pytorch#126334

    )
    
    This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`.
    
    It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage.
    
    The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users.
    
    Pull Request resolved: pytorch#126334
    Approved by: https://github.com/yifuwang, https://github.com/wanchaol
    yf225 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    64efc14 View commit details
    Browse the repository at this point in the history
  112. Configuration menu
    Copy the full SHA
    0ddafc0 View commit details
    Browse the repository at this point in the history
  113. [Reopen] Upgrade submodule oneDNN to v3.4.2 (pytorch#126137)

    Reopen of pytorch#122472
    
    ## Improvements
    This upgrade fixes the following issues:
    - pytorch#120982
    
    This upgrade brings the following new features:
    - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (pytorch#114450)
    
    ## Validation results on CPU
    Original results with oneDNN v3.4.1 are here: pytorch#122472 (comment)
    
    Need to rerun validation and update results.
    
    Co-authored-by: Sunita Nadampalli <nadampal@amazon.com>
    Pull Request resolved: pytorch#126137
    Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman
    Xia-Weiwen authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8288174 View commit details
    Browse the repository at this point in the history
  114. [FSDP2] Supported set_all_reduce_gradients=False for HSDP (pytorch#…

    …126166)
    
    **Context**
    For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients).
    - FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`.
    - FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`.
    
    For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2).
    - FSDP2 offers (1) without any intervention like mentioned above.
    - FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above.
    - FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`.
    
    **Overview**
    For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like:
    ```
    for microbatch_idx, microbatch in enumerate(microbatches):
        is_last_microbatch = microbatch_idx == len(microbatches) - 1
        model.set_requires_all_reduce(is_last_microbatch)
        # Run forward/backward
    ```
    
    This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only.
    
    **Developer Notes**
    We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output.
    
    Pull Request resolved: pytorch#126166
    Approved by: https://github.com/weifengpy, https://github.com/wanchaol
    ghstack dependencies: pytorch#126067, pytorch#126070, pytorch#126161
    awgu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    5dd875a View commit details
    Browse the repository at this point in the history
  115. Fix aarch64 debug build with GCC (pytorch#126290)

    By working around GCCs quirks in instantiating templates that require immediate values.
    Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)
    
    Fixes pytorch#126283
    
    Pull Request resolved: pytorch#126290
    Approved by: https://github.com/atalman, https://github.com/seemethere
    malfet authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    cebb5df View commit details
    Browse the repository at this point in the history
  116. Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (pytorch#126336

    )
    
    Fixes pytorch#125504
    Fixes pytorch#126252
    Fixes pytorch#126296
    Fixes pytorch#126330
    
    This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR.
    
    Pull Request resolved: pytorch#126336
    Approved by: https://github.com/huydhn, https://github.com/pruthvistony
    jithunnair-amd authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    60fb3ef View commit details
    Browse the repository at this point in the history
  117. [ROCm] amax hipblaslt integration (pytorch#125921)

    AMAX is coming as part of rocm6.2. This code adds that functionality
    
    Pull Request resolved: pytorch#125921
    Approved by: https://github.com/eqy, https://github.com/lezcano
    alugorey authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9df7bda View commit details
    Browse the repository at this point in the history
  118. Configuration menu
    Copy the full SHA
    19dfbce View commit details
    Browse the repository at this point in the history
  119. [AOTI][torchgen] Support at::Generator via C shim (pytorch#126181)

    Summary: Support at::Generator which is used by many random number generator ops
    Pull Request resolved: pytorch#126181
    Approved by: https://github.com/chenyang78
    desertfire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    7e7392b View commit details
    Browse the repository at this point in the history
  120. [AOTI] Refactor some fallback op util functions (pytorch#126182)

    Summary: Move some util functions for cpp kernel naming and missing arg filling from FallbackKernel to ExternKernel, since they are useful for ExternKernel in general.
    
    Pull Request resolved: pytorch#126182
    Approved by: https://github.com/chenyang78
    ghstack dependencies: pytorch#126181
    desertfire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    27b7381 View commit details
    Browse the repository at this point in the history
  121. [AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (

    …pytorch#126183)
    
    Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes pytorch#121809
    
    Pull Request resolved: pytorch#126183
    Approved by: https://github.com/angelayi
    ghstack dependencies: pytorch#126181, pytorch#126182
    desertfire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    d27e21d View commit details
    Browse the repository at this point in the history
  122. [AOTI][refactor] Add aoti_torch_item as a util function (pytorch#126352)

    Summary: The logic has been repeated several times in the code, so it's worth to write a common util function.
    
    Pull Request resolved: pytorch#126352
    Approved by: https://github.com/chenyang78
    ghstack dependencies: pytorch#126181, pytorch#126182, pytorch#126183
    desertfire authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    272b119 View commit details
    Browse the repository at this point in the history
  123. Configuration menu
    Copy the full SHA
    667af78 View commit details
    Browse the repository at this point in the history
  124. [BE][FSDP] Remove unnecessary warnings (pytorch#126365)

    As title
    
    Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/)
    
    Pull Request resolved: pytorch#126365
    Approved by: https://github.com/awgu, https://github.com/Skylion007
    ghstack dependencies: pytorch#126362
    fegin authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    08e5a7e View commit details
    Browse the repository at this point in the history
  125. [onnx.export] Cache SetGraphInputTypeReliable (pytorch#124912)

    This PR is part of an effort to speed up torch.onnx.export (pytorch#121422).
    
    - For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state.
    - Resolves (6) in pytorch#121422.
    - Also see pytorch#123028 with a similar addition of a cache state.
    
    (partial fix of pytorch#121545)
    
    Pull Request resolved: pytorch#124912
    Approved by: https://github.com/justinchuby
    gustavla authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2a34465 View commit details
    Browse the repository at this point in the history
  126. Remove redundant serialization code (pytorch#126249)

    After pytorch#123308, we no longer need separate serialization path to handle different types that exist in the `nn_module` metadata. This PR cleans up the redundant code.
    Pull Request resolved: pytorch#126249
    Approved by: https://github.com/angelayi
    jiashenC authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    45a699a View commit details
    Browse the repository at this point in the history
  127. Configuration menu
    Copy the full SHA
    b24a9e3 View commit details
    Browse the repository at this point in the history
  128. xpu: implement xpu serialization (pytorch#125530)

    Fixes: pytorch#125529
    
    BC-breaking note:
    The deprecated "async" argument to the Storage.cuda and Storage.hpu has been removed. Use non_blocking instead.
    
    CC: @jbschlosser, @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @albanD
    
    Pull Request resolved: pytorch#125530
    Approved by: https://github.com/guangyey, https://github.com/albanD
    dvrogozh authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a2e563d View commit details
    Browse the repository at this point in the history
  129. Don't install inplace_methods on MockHandler, not needed (pytorch#126398

    )
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126398
    Approved by: https://github.com/jansel, https://github.com/peterbell10
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    4c93c7a View commit details
    Browse the repository at this point in the history
  130. Make 'pytest test/inductor/test_memory_planning.py' work (pytorch#126397

    )
    
    There's still another naughty direct test_* import, I'm out of patience
    right now though.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126397
    Approved by: https://github.com/peterbell10, https://github.com/int3
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    f1897d4 View commit details
    Browse the repository at this point in the history
  131. Switched from parameter in can_cast to from_. (pytorch#126030)

    Fixes pytorch#126012.
    
    `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.
    
    If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.
    
    Pull Request resolved: pytorch#126030
    Approved by: https://github.com/albanD
    tringwald authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    f4daf9e View commit details
    Browse the repository at this point in the history
  132. [Traceable FSDP2] Use DTensor.from_local() in _from_local_no_grad whe…

    …n compile (pytorch#126346)
    
    As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`.
    
    This won't affect eager and it's a compile-only change.
    
    Pull Request resolved: pytorch#126346
    Approved by: https://github.com/awgu
    yf225 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    74ad455 View commit details
    Browse the repository at this point in the history
  133. Fix strict default value in StateDictOptions (pytorch#125998)

    Fixes pytorch#125992
    
    The default value of the parameter `strict` should be `True`.
    
    Pull Request resolved: pytorch#125998
    Approved by: https://github.com/fegin
    shink authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    255ae5d View commit details
    Browse the repository at this point in the history
  134. Print export warning only once in capture_pre_autograd (pytorch#126403)

    Summary: Missed this in D57163341
    
    Test Plan: CI
    
    Differential Revision: D57442088
    
    Pull Request resolved: pytorch#126403
    Approved by: https://github.com/zhxchen17
    tarun292 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    1367209 View commit details
    Browse the repository at this point in the history
  135. [compiled autograd] Fix LoggingTensor flaky test (pytorch#126144)

    LoggingTensor fails consistently when root logger level is INFO or lower
    By default, root logger should be WARNING
    But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: pytorch#126143
    
    Pull Request resolved: pytorch#126144
    Approved by: https://github.com/jansel
    xmfan authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    80798a7 View commit details
    Browse the repository at this point in the history
  136. [inductor] Clear cache on ctx manager exit (pytorch#126146)

    FIXES pytorch#126128.
    
    Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again,  usually fine in tests.
    
    Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd.
    TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't
    
    Pull Request resolved: pytorch#126146
    Approved by: https://github.com/jgong5, https://github.com/oulgen
    ghstack dependencies: pytorch#126144
    xmfan authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b2efbae View commit details
    Browse the repository at this point in the history
  137. [compiled autograd] clear compiled_autograd_verbose once test is done (

    …pytorch#126148)
    
    verbose flag leaks into tests ran after
    
    Pull Request resolved: pytorch#126148
    Approved by: https://github.com/jansel
    ghstack dependencies: pytorch#126144, pytorch#126146
    xmfan authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b29fd1f View commit details
    Browse the repository at this point in the history
  138. Configuration menu
    Copy the full SHA
    19e7924 View commit details
    Browse the repository at this point in the history
  139. Eliminate some C++11 checks (pytorch#126308)

    Test Plan: Sandcastle
    
    Reviewed By: palmje
    
    Differential Revision: D57246912
    
    Pull Request resolved: pytorch#126308
    Approved by: https://github.com/Skylion007
    r-barnes authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    cd76785 View commit details
    Browse the repository at this point in the history
  140. Add prefix option to CapabilityBasedPartitioner (pytorch#126382)

    Summary: Add prefix arg so that users can provide the submodule name to partitioner.
    
    Test Plan: https://fburl.com/anp/2kue4qp9
    
    Differential Revision: D57416926
    
    Pull Request resolved: pytorch#126382
    Approved by: https://github.com/SherlockNoMad
    hongyang-zhao authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2b7ac1e View commit details
    Browse the repository at this point in the history
  141. Import MKL via //third-party/mkl targets (pytorch#126371)

    Summary:
    This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2.
    
    - Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`.
    - Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]`
    
    Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*".
    
    Test Plan: sandcastle
    
    Differential Revision: D57360438
    
    Pull Request resolved: pytorch#126371
    Approved by: https://github.com/bertmaher
    MatzeB authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    1948225 View commit details
    Browse the repository at this point in the history
  142. [c10d] add pg_name and pg_desc to logger (pytorch#126409)

    Summary:
    This should further improve our debuggability
    
    Tags:
    
    Pull Request resolved: pytorch#126409
    Approved by: https://github.com/XilunWu
    shuqiangzhang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3bbd7fa View commit details
    Browse the repository at this point in the history
  143. Use object identity for deepcopy memo (pytorch#126126)

    Copy of pytorch#126089, with some additional fixes & tests
    
    Partial fix for pytorch#125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation.
    
    The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes:
    * (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying
    * (still kind of wrong): views won't actually alias each other after deepcopying.
    * (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias.
    
    BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++.
    
    Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306)
    Pull Request resolved: pytorch#126126
    Approved by: https://github.com/ezyang
    davidberard98 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ac162de View commit details
    Browse the repository at this point in the history
  144. Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/…

    …o epilogue fusion (pytorch#126068)"
    
    This reverts commit 927e631.
    
    Reverted pytorch#126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR pytorch#124021 is going to be revert ([comment](pytorch#126019 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    fa207b5 View commit details
    Browse the repository at this point in the history
  145. Revert "[inductor][cpp] epilogue support for gemm template (pytorch#1…

    …26019)"
    
    This reverts commit 7844c20.
    
    Reverted pytorch#126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR pytorch#124021 is going to be revert ([comment](pytorch#126019 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8f51cf7 View commit details
    Browse the repository at this point in the history
  146. Revert "[inductor][cpp] GEMM template (infra and fp32) (pytorch#124021)"

    This reverts commit f060b0c.
    
    Reverted pytorch#124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](pytorch#124021 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2a6c92a View commit details
    Browse the repository at this point in the history
  147. Add Lowering for FlexAttention Backwards (pytorch#125515)

    # Summary
    #### What does this PR do?
    It enables Inductor to actually generate the fused flex attention kernel for the backwards
    
    I did some other things along the way:
    - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
    - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
    - I didnt correctly register the decomp table + IndexMode when I landed: pytorch#123902, this remedies that.
    - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
    - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
    - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
    - I updated the benchmark to also profile bwds performance
    
    ### Benchmark Numbers:
    _The current implementation is not parallelizing over ctx length in the bwd_
    FWD Speedups
    
    | Type    |   Speedup | shape              | score_mod   | dtype          |
    |---------|-----------|--------------------|-------------|----------------|
    | Average |     0.991 |                    |             |                |
    | Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
    | Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |
    
    BWD Speedups
    
    | Type    |   Speedup | shape              | score_mod   | dtype          |
    |---------|-----------|--------------------|-------------|----------------|
    | Average |     0.291 |                    |             |                |
    | Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
    | Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |
    
    <details>
    
    <summary>Full Data</summary>
    
    | shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
    |---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
    | (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
    | (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
    | (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
    | (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
    | (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
    | (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
    | (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
    | (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
    | (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
    | (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
    | (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
    | (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
    | (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
    | (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
    | (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
    | (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
    | (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
    | (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
    | (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
    | (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
    | (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
    | (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
    | (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
    | (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
    | (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
    | (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
    | (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
    | (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
    | (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
    | (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
    | (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
    | (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
    | (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
    | (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
    | (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
    | (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
    | (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
    | (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
    | (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
    | (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
    | (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
    | (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
    | (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
    | (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
    | (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
    | (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
    | (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
    | (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
    | (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
    | (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
    | (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
    | (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
    | (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
    | (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
    | (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
    | (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
    | (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
    | (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
    | (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
    | (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
    | (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
    | (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
    | (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
    | (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
    | (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
    | (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
    | (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
    | (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
    | (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
    | (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
    | (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
    | (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
    | (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
    | (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
    | (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
    | (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
    | (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
    | (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
    | (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
    | (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
    | (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
    | (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
    | (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
    | (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
    | (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
    | (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
    | (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
    | (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
    | (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
    | (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
    | (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
    | (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
    | (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
    | (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
    | (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
    | (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
    | (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
    | (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
    | (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
    | (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
    | (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
    | (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
    | (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
    | (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
    | (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
    | (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
    | (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
    | (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |
    
    </details>
    
    Pull Request resolved: pytorch#125515
    Approved by: https://github.com/Chillee
    drisspg authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    7cea4a5 View commit details
    Browse the repository at this point in the history
  148. Fix documentation for register_fake_class (pytorch#126422)

    Pull Request resolved: pytorch#126422
    Approved by: https://github.com/angelayi
    ydwu4 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    814dbc7 View commit details
    Browse the repository at this point in the history
  149. [export] Delete predispatch tests (pytorch#126459)

    Deleting predispatch tests as we moved export to predispatch already
    Pull Request resolved: pytorch#126459
    Approved by: https://github.com/tugsbayasgalan
    angelayi authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    65f4d4f View commit details
    Browse the repository at this point in the history
  150. [DeviceMesh] Supported N groups in from_group (pytorch#126258)

    **Overview**
    This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).
    
    This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.
    
    <details>
    <summary> Old Approach </summary>
    
    **Overview**
    - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
        - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
    - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.
    
    </details>
    
    Pull Request resolved: pytorch#126258
    Approved by: https://github.com/wanchaol
    awgu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b6d8201 View commit details
    Browse the repository at this point in the history
  151. [easy] Fix typing for map_location docs in torch.load (pytorch#125473)

    Currently it incorrectly has `Callable[[Tensor, str], Tensor]` as a possible type signature, this should be `Callable[[Storage, str], Storage]`
    
    <img width="716" alt="Screenshot 2024-05-03 at 12 09 54 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b8946f95-8297-445f-a9d9-570b8a3caab1">
    
    Pull Request resolved: pytorch#125473
    Approved by: https://github.com/albanD
    mikaylagawarecki authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b9da19d View commit details
    Browse the repository at this point in the history
  152. [doc] expose torch.Tensor.xpu API to doc (pytorch#126383)

    # Motivation
    The doc string related `torch.Tensor.xpu` has been added [here](https://github.com/pytorch/pytorch/blob/d61a81a9e76688ac8f338a6cfba932bf7779e5ce/torch/_tensor_docs.py#L1434) but not expose it to public doc, like [torch.Tensor.cuda](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda). This PR intends to expose the document of `torch.Tensor.xpu` to public doc.
    
    Pull Request resolved: pytorch#126383
    Approved by: https://github.com/albanD
    guangyey authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    22b4b22 View commit details
    Browse the repository at this point in the history
  153. Add symbolic_shape_specialization structured trace (pytorch#126450)

    This is typically the information you want when diagnosing why something
    overspecialized in dynamic shapes.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    Pull Request resolved: pytorch#126450
    Approved by: https://github.com/albanD
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6fc8524 View commit details
    Browse the repository at this point in the history
  154. Make inductor scheduler graph extension configurable (pytorch#125578)

    This patch makes the inductor scheduler graph extension configurable.
    It enables ease of debugging by changing the graph format (dot, png, etc.).
    
    Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz
    
    Pull Request resolved: pytorch#125578
    Approved by: https://github.com/Chillee
    AlexDenisov authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    54ce306 View commit details
    Browse the repository at this point in the history
  155. [FSDP2][Test] Fix _test_clip_grad_norm (pytorch#126457)

    Fixes #ISSUE_NUMBER
    We need to compare ref_total_norm to total_norm.full_tensor().
    Example:
    ```
    iter_idx:0, rank:0,\
    ref_total_norm=tensor(1052.5934, device='cuda:0'),\
    total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\
    total_norm.full_tensor()=tensor(1052.5934, device='cuda:0')
    ```
    
    Pull Request resolved: pytorch#126457
    Approved by: https://github.com/awgu
    wz337 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0f31e61 View commit details
    Browse the repository at this point in the history
  156. dont pad 0 dim mm inputs (pytorch#126475)

    Otherwise you get an error in constant_pad_nd.
    
    Pull Request resolved: pytorch#126475
    Approved by: https://github.com/huydhn
    ghstack dependencies: pytorch#125772, pytorch#125773, pytorch#125780
    eellison authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2cbbe21 View commit details
    Browse the repository at this point in the history
  157. c10d: add Collectives abstraction (pytorch#125978)

    This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.
    
    Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit
    
    The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.
    
    Test plan:
    
    ```
    python test/distributed/test_collectives.py -v
    ```
    
    This tests both functionality using multiple threads as well as timeout behavior.
    
    Pull Request resolved: pytorch#125978
    Approved by: https://github.com/shuqiangzhang
    d4l3k authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a05c0fa View commit details
    Browse the repository at this point in the history
  158. Add dist_pp shortcut to TORCH_LOGS (pytorch#126322)

    distributed log category already includes pipelining since its under the
    torch.distributed umbrella.
    
    So both TORCH_LOGS=distributed and TORCH_LOGS=dist_pp will enable PP
    logs.
    Pull Request resolved: pytorch#126322
    Approved by: https://github.com/kwen2501
    wconstab authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ae7ee03 View commit details
    Browse the repository at this point in the history
  159. [dtensor] refactor view ops to use OpStrategy (pytorch#126011)

    As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`).
    
    Pull Request resolved: pytorch#126011
    Approved by: https://github.com/wanchaol, https://github.com/XilunWu
    tianyu-l authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    c61bdbf View commit details
    Browse the repository at this point in the history
  160. [XPU] call empty_cache for dynamo tests (pytorch#126377)

    When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models.
    
    This PR unifies the `empty_cache` call for both CUDA and XPU.
    
    Pull Request resolved: pytorch#126377
    Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
    Stonepia authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b1770bd View commit details
    Browse the repository at this point in the history
  161. Refactor partitioner and clean it up (pytorch#126318)

    Pull Request resolved: pytorch#126318
    Approved by: https://github.com/anijain2305
    Chillee authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    c271827 View commit details
    Browse the repository at this point in the history
  162. [DTensor] Turn on foreach implementation for clip_grad_norm_ for DTen…

    …sor by default (pytorch#126423)
    
    Fixes #ISSUE_NUMBER
    
    Pull Request resolved: pytorch#126423
    Approved by: https://github.com/awgu
    wz337 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    99190da View commit details
    Browse the repository at this point in the history
  163. Fix cummax and cummin lowering for empty case (pytorch#126461)

    Pull Request resolved: pytorch#126461
    Approved by: https://github.com/peterbell10
    isuruf authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8221d3d View commit details
    Browse the repository at this point in the history
  164. [Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion fo…

    …r X86Inductor (pytorch#122593)
    
    **Description**
    Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add.
    
    **Supported linear-binary(-unary) patterns**
    ```
        linear(X)   extra input
               \   /
                Add
                 |
            Optional(relu)
                 |
                 Y
    
    1. int8-mixed-fp32
    +---+---------------+-----------+------------------------------+---------+
    | # | Add type      | Quant out | Pattern                      | Post op |
    +---+---------------+-----------+------------------------------+---------+
    | 1 | In-/out-place | Yes       | linear + fp32 -> (relu) -> q | add     |
    +---+---------------+-----------+------------------------------+---------+
    | 2 | In-/out-place | No        | linear + fp32 -> (relu)      | sum     |
    +---+---------------+-----------+------------------------------+---------+
    
    2. int8-mixed-bf16
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | # | X2 dtype | Add type      | Quant out | Pattern                                          | Post op |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | 1 | BF16     | In-/out-place | Yes       | linear + bf16 -> (relu) -> to_fp32 -> q          | add     |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | 2 | BF16     | In-/out-place | No        | linear + bf16 -> (relu)                          | sum     |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | 3 | FP32     | Out-place     | Yes       | linear + fp32 -> (relu) -> q                     | add     |
    |   |          | In-place right|           |                                                  |         |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | 4 | FP32     | Out-place     | No        | linear + fp32 -> (relu)                          | sum     |
    |   |          | In-place right|           |                                                  |         |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | 5 | FP32     | In-place left | Yes       | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add     |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    | 6 | FP32     | In-place left | No        | linear + fp32 -> to_bf16 -> (relu)               | add     |
    +---+----------+---------------+-----------+--------------------------------------------------+---------+
    ```
    Note
    (1) The positions of linear and the extra input can be swapped.
    (2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the
    extra input, we don't match that pattern because we cannot match all these patterns in 3 passes.
    
    **Test plan**
    python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add
    python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add
    
    Pull Request resolved: pytorch#122593
    Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
    Xia-Weiwen authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    747bdea View commit details
    Browse the repository at this point in the history
  165. variable search spaces for gemm autotuning (pytorch#126220)

    add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6]
    
    Pull Request resolved: pytorch#126220
    Approved by: https://github.com/eellison
    nmacchioni authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    f55c0cc View commit details
    Browse the repository at this point in the history
  166. save the reciprocal of weights for welford_reduce (pytorch#125148)

    Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel.
    
    Generated code:
    
    - Before:
    
    ```
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
    {
        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
    }
    ```
    
    - After::
    
    ```
    static WeightRecp<at::vec::Vectorized<float>> weight_recps(64);
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
    {
        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
    }
    ```
    
    Performance:
    
    - Single core:
    
    Op | shape | eager/ms | inductor/ms | optimized inductor/ms
    -- | -- | -- | -- | --
    layernorm | (56, 384, 1024) | 16.825 | 22.338 | 15.208
    var | (56, 384, 1024) | 21.752 | 13.258 | 13.102
    
    - 4 cores:
    
    Op | shape | eager/ms | inductor/ms | optimized inductor/ms
    -- | -- | -- | -- | --
    layernorm | (56, 384, 1024) | 4.249 | 5.899 | 4.223
    var | (56, 384, 1024) | 5.3152 | 3.278 | 2.163
    
    Pull Request resolved: pytorch#125148
    Approved by: https://github.com/jgong5, https://github.com/peterbell10
    CaoE authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    ae3c9ca View commit details
    Browse the repository at this point in the history
  167. [Submodule] Remove zstd dependency (pytorch#126485)

    After searching in the codebase, it seems that zstd is not in use now.
    
    Pull Request resolved: pytorch#126485
    Approved by: https://github.com/ezyang
    cyyever authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9882241 View commit details
    Browse the repository at this point in the history
  168. Update ops handler documentation some more (pytorch#126480)

    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126480
    Approved by: https://github.com/peterbell10
    ghstack dependencies: pytorch#126292, pytorch#126299
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    7263893 View commit details
    Browse the repository at this point in the history
  169. [FSDP2] Fixed 2D clip grad norm test (pytorch#126497)

    This fixes pytorch#126484.
    
    We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.
    
    Pull Request resolved: pytorch#126497
    Approved by: https://github.com/weifengpy, https://github.com/wz337
    awgu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    9a47caa View commit details
    Browse the repository at this point in the history
  170. Default to env variable instead of config value for precompile parall…

    …elism (pytorch#126333)
    
    Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks.
    
    This changes so we default it to use as many threads as needed unless the env variable is set.
    
    Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023)
    Pull Request resolved: pytorch#126333
    Approved by: https://github.com/nmacchioni
    eellison authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    be7b65a View commit details
    Browse the repository at this point in the history
  171. Delete refactored function, move changes over (pytorch#126407)

    Oops, in pytorch#125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. pytorch#126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended.
    
    Pull Request resolved: pytorch#126407
    Approved by: https://github.com/eellison
    jamesjwu authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3f1ccfd View commit details
    Browse the repository at this point in the history
  172. [optim] Fix: wrong ASGD implementation (pytorch#126375)

    This PR is based on pytorch#125440, additionally merging the latest main branch and fixing the lint failures from pytorch#126361.
    
    Pull Request resolved: pytorch#126375
    Approved by: https://github.com/janeyx99
    david20571015 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    e1a0676 View commit details
    Browse the repository at this point in the history
  173. Configuration menu
    Copy the full SHA
    0be8b0f View commit details
    Browse the repository at this point in the history
  174. Remove removed ruff rule TRY200 (pytorch#126256)

    My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema.
    
    From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/
    
    > This rule has been removed and its documentation is only available for historical reasons.
    >
    > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead.
    
    and we are currently explicitly ignoring B904.
    
    Pull Request resolved: pytorch#126256
    Approved by: https://github.com/Skylion007
    ringohoffman authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    bd10ff6 View commit details
    Browse the repository at this point in the history
  175. [Perf] Vectorize more dtype for int4mm (pytorch#126512)

    It used to be vectorized only for f16, but no reason not to do the same for bf16 or f32
    
    Spiritual followup of pytorch#125290
    
    Pull Request resolved: pytorch#126512
    Approved by: https://github.com/Skylion007
    malfet authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    e24f7b3 View commit details
    Browse the repository at this point in the history
  176. [inductor] fix unbacked case in pointwise + reduction vertical fusion (

    …pytorch#125982)
    
    ```
    $ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion
    
      File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once
        for node1, node2 in self.get_possible_fusions():
      File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions
        check_all_pairs(node_grouping)
      File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs
        if self.can_fuse(node1, node2):
      File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse
        return self.get_backend(device).can_fuse_vertical(node1, node2)
      File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical
        return self._triton_scheduling.can_fuse_vertical(node1, node2)
      File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse
        if not all(
      File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr>
        TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges())
      File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible
        cls._split_iteration_ranges(groups, lengths)
      File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges
        while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1:
      File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint
        return int(out)
      File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__
        raise TypeError("Cannot convert symbols to int")
    torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
    TypeError: Cannot convert symbols to int
    ```
    
    Where the unbacked symints show up at.
    ```
    > /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges()
    (Pdb) print(groups)
    (1, 512*u0)
    (Pdb) print(lengths)
    ([u0, 32, 16], [])
    ```
    
    Pull Request resolved: pytorch#125982
    Approved by: https://github.com/jansel
    ColinPeppler authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    bb5e037 View commit details
    Browse the repository at this point in the history
  177. Workflow for uploading additional test stats on workflow dispatch (py…

    …torch#126080)
    
    This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info
    
    Add workflow that is callable via workflow dispatch for uploading additional test stats
    Adds script that only calculates the additional info
    
    Pull Request resolved: pytorch#126080
    Approved by: https://github.com/ZainRizvi
    clee2000 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    45a8ba4 View commit details
    Browse the repository at this point in the history
  178. Allow tensor subclasses and add `torch.serialization.add_safe_globals…

    …` that allows users to allowlist classes for `weights_only` load (pytorch#124331)
    
    #### Conditions for allowlisting tensor subclasses
    We allow tensor subclasses types that
    (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
    (2) Use the generic `tp_alloc`
    (3) Are in a module that *has been imported by the user*
    to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict
    
    The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`
    
    *Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.
    
    The rationale for the 3 conditions above is as follows:
    
    The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)
    
    https://github.com/pytorch/pytorch/blob/4e66aaa01092ddc8822bbca315b673329c76f4cd/torch/_tensor.py#L57-L71
    
    `as_subclass` is implemented with a call to `THPVariable_NewWithVar`
    
    that will eventually call `tp_alloc` here
    https://github.com/pytorch/pytorch/blob/4e66aaa01092ddc8822bbca315b673329c76f4cd/torch/csrc/autograd/python_variable.cpp#L2053
    
    The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`
    
    **Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**
    
    ### How do we check something is a tensor subclass/constraints around imports
    
    In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`
    
    This PR also allowlisted  `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)
    
    ### API for allow listing
    This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).
    
    Next steps:
    - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)
    
    Pull Request resolved: pytorch#124331
    Approved by: https://github.com/albanD
    mikaylagawarecki authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    d0d2d0b View commit details
    Browse the repository at this point in the history
  179. Configuration menu
    Copy the full SHA
    39f5adb View commit details
    Browse the repository at this point in the history
  180. [quant][pt2e] Allow multi users without output observers (pytorch#126487

    )
    
    Summary: The PT2E quantization flow does not support unquantized
    outputs yet. To work around this, users may wish to remove the
    output observer from their graphs. However, this fails currently
    in some cases because the `PortNodeMetaForQDQ` pass is too
    restrictive, for example:
    
    ```
    conv -> obs -------> output0
             \\-> add -> output1
    ```
    
    Previously we expected conv to always have exactly 1 user,
    which is the observer. When the observer is removed, however,
    conv now has 2 users, and this fails the check.
    
    ```
    conv -------> output0
      \\-> add -> output1
    ```
    
    This commit relaxes the error into a warning to enable
    this workaround.
    
    Test Plan:
    python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer
    
    Reviewers: jerryzh168
    
    Subscribers: jerryzh168, supriyar
    
    Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601)
    Pull Request resolved: pytorch#126487
    Approved by: https://github.com/tarun292
    andrewor14 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    218756f View commit details
    Browse the repository at this point in the history
  181. Add coms metadata to execution trace (ET) (pytorch#126317)

    Add Execution Trace communication collective meta data.
    For specification see pytorch#124674
    
    New fields look like
    ```
        {
          "id": 80, "name": "record_param_comms", "ctrl_deps": 79,
          "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]},                             "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]},
          "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""},
      {"name": "collective_name", "type": "string", "value": "allreduce"},
      {"name": "dtype", "type": "string", "value": "Float"},
      {"name": "in_msg_nelems", "type": "uint64", "value": 100},
      {"name": "out_msg_nelems", "type": "uint64", "value": 100},
      {"name": "in_split_size", "type": "string", "value": "[]"},
      {"name": "out_split_size", "type": "string", "value": "[]"},
      {"name": "global_rank_start", "type": "uint64", "value": 0},
      {"name": "global_rank_stride", "type": "uint64", "value": 1},
      {"name": "pg_name", "type": "string", "value": "0"},
      {"name": "pg_desc", "type": "string", "value": "default_pg"},
      {"name": "pg_size", "type": "uint64", "value": 2}]
     }
    ```
    
    ## Unit Test
    Added a new unit test to check the execution trace collected has right attributes
    
    `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace`
    
    ```
    STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    [rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
    indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
    [rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
    indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
    [rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
    [rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
    STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    [rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
    STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
    [rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
    STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    Execution trace saved at /tmp/tmpy01ngc3w.et.json
    Execution trace saved at /tmp/tmptf8543k4.et.json
    ok
    
    ----------------------------------------------------------------------
    ```
    
    Also run profilerunit test
    `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler`
    
    ```
    STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    [rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
    [rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
    STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
    STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
    STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
    Trace saved to /tmp/tmpdrw_cmcu.json
    Trace saved to /tmp/tmpnio7ec9j.json
    ok
    
    ----------------------------------------------------------------------
    Ran 1 test in 19.772s
    
    OK
    ```
    
    Pull Request resolved: pytorch#126317
    Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise
    briancoutinho authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    45a3349 View commit details
    Browse the repository at this point in the history
  182. Revert "Remove redundant serialization code (pytorch#126249)"

    This reverts commit aab448e.
    
    Reverted pytorch#126249 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing sigmoid/frontend:serialization_test internally ([comment](pytorch#126249 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2f044a8 View commit details
    Browse the repository at this point in the history
  183. Revert "Fix aarch64 debug build with GCC (pytorch#126290)"

    This reverts commit 91bf952.
    
    Reverted pytorch#126290 on behalf of https://github.com/huydhn due to There seems to be a mis-match closing curly bracket here and it breaks some internal build in D57474505 ([comment](pytorch#126290 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    02bf7e2 View commit details
    Browse the repository at this point in the history
  184. Initial implementation of AdaRound (pytorch#126153)

    Summary:
    This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568
    
    This algorithm is going to be used by multiple people, hence we need make it official implementation.
    
    Differential Revision: D57227565
    
    Pull Request resolved: pytorch#126153
    Approved by: https://github.com/jerryzh168, https://github.com/huydhn
    kwanghoon-meta authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b2aff20 View commit details
    Browse the repository at this point in the history
  185. [distributed] Add cpp-httplib to pytorch (pytorch#126470)

    Adds https://github.com/yhirose/cpp-httplib such that we are able to use https for host to host communication in distributed (specifically torchrun)
    
    Todo: We likely need to add cpp-httplib somewhere in the build (cmake/bazel) but first we should write the code for it.
    Pull Request resolved: pytorch#126470
    Approved by: https://github.com/d4l3k, https://github.com/Skylion007
    PaliC authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    782792b View commit details
    Browse the repository at this point in the history
  186. [BE][Ez]: Use NotADirectoryError in tensorboard writer (pytorch#126534)

    Slightly improve exception typing for tensorboard wrriter
    Pull Request resolved: pytorch#126534
    Approved by: https://github.com/ezyang
    Skylion007 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    5182e2e View commit details
    Browse the repository at this point in the history
  187. Revert "[FSDP2] Fixed 2D clip grad norm test (pytorch#126497)"

    This reverts commit 3f28906.
    
    Reverted pytorch#126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](pytorch#126497 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    c81bf77 View commit details
    Browse the repository at this point in the history
  188. [ROCm] enable faster_load_save for Fused_SGD (pytorch#125456)

    Reopen due to rebase error. Fixes pytorch#117599
    
    The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR
    
    HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh
    
    ```
    :4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
    :4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
    :3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
    ```
    
    Pull Request resolved: pytorch#125456
    Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
    petrex authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6372770 View commit details
    Browse the repository at this point in the history
  189. Experimental prototype for converting torch.jit.trace modules to expo…

    …rt (pytorch#124449)
    
    Differential Revision: [D56440613](https://our.internmc.facebook.com/intern/diff/D56440613)
    
    We want to do this for following reasons:
    1. There is current limitation in export tracing for torch.jit.trace d modules that cannot be easily upstreamed
    2. We need to run internal CI regularly to understand feature gaps and continuously track them
    3. Multiple people will be working on this prototype so it is better to have a checked in version so we don't always run into merge conflicts.
    
    Pull Request resolved: pytorch#124449
    Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
    tugsbayasgalan authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    04c3751 View commit details
    Browse the repository at this point in the history
  190. Configuration menu
    Copy the full SHA
    a1245dd View commit details
    Browse the repository at this point in the history
  191. [AOTI] config target platform (pytorch#126306)

    Test Plan: AOTI compile stories15M for Android
    
    Differential Revision: D57392830
    
    Pull Request resolved: pytorch#126306
    Approved by: https://github.com/desertfire
    manuelcandales authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    68a6cdd View commit details
    Browse the repository at this point in the history
  192. Fix issue of lowering nn.linear ops with kwargs (pytorch#126331)

    Summary: Support kwarg bias for nn.linear quantization
    
    Differential Revision: D57403190
    
    Pull Request resolved: pytorch#126331
    Approved by: https://github.com/ZhengkaiZ, https://github.com/huydhn
    yihanhemeta authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    a6235d0 View commit details
    Browse the repository at this point in the history
  193. [inductor] Load python modules using importlib (pytorch#126454)

    The `compile` + `exec` workflow is susceptible to behavior drifting from
    a "normal" import use importlib instead to avoid this.
    
    In particular here annotations were being stored as strings due to
    `from __futures__ import annotations` in the scope calling `compile`.
    Triton cares about annotations on global variables and this makes it
    much easier to reliably code-gen them.
    
    Pull Request resolved: pytorch#126454
    Approved by: https://github.com/peterbell10
    amjames authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6e4ed6c View commit details
    Browse the repository at this point in the history
  194. Configuration menu
    Copy the full SHA
    edbd215 View commit details
    Browse the repository at this point in the history
  195. Added error checks for invalid inputs on thnn_conv2d (pytorch#121906)

    Fixes pytorch#121188
    Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d'
    
    Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format.
    
    Additionally, this commit includes tests to cover the three referenced cases.
    
    Pull Request resolved: pytorch#121906
    Approved by: https://github.com/janeyx99
    Martim03 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6708519 View commit details
    Browse the repository at this point in the history
  196. Fix aarch64 debug build with GCC (pytorch#126290)

    By working around GCCs quirks in instantiating templates that require immediate values.
    Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)
    
    Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0`
    
    Fixes pytorch#126283
    
    Pull Request resolved: pytorch#126290
    Approved by: https://github.com/atalman, https://github.com/seemethere
    malfet authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    38a85b2 View commit details
    Browse the repository at this point in the history
  197. Remove dist_ prefix from TORCH_LOGS shortcuts (pytorch#126499)

    e.g. dist_ddp -> ddp
    
    'distributed' shortcut remains unchained
    
    Feedback has been that it is not appealing to have the dist_ prefix,
    and the main reason for it was to keep the distributed shortcuts grouped
    together in the help menu.  It's nice to have shorter shortcuts.
    Pull Request resolved: pytorch#126499
    Approved by: https://github.com/XilunWu, https://github.com/kwen2501
    ghstack dependencies: pytorch#126322
    wconstab authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8ab08f9 View commit details
    Browse the repository at this point in the history
  198. Tool for scouting exportability in one shot (pytorch#126471)

    Summary:
    Tool for scouting exportability issues in one shot.
    
    - Collect sample inputs for all submodules by running eager inference with forward_pre_hook.
    - Start from root module, recursively try exporting child modules, if current module export fails.
    
    Limitations:
    - only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule.
    
    TODO: support dynamic_dims
    
    Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing
    
    ```
    exportability_report =
            {
                '': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
                'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
                'submod_2': None
            }
    ```
    
    Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools
    
    Differential Revision: D57466486
    
    Pull Request resolved: pytorch#126471
    Approved by: https://github.com/zhxchen17
    SherlockNoMad authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    bd786d8 View commit details
    Browse the repository at this point in the history
  199. [torch-distributed] Make log directory creation idempotent (pytorch#1…

    …26496)
    
    Summary:
    https://docs.python.org/3/library/os.html#os.makedirs
    > If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists.
    
    Test Plan: Existing tests
    
    Differential Revision: D57471577
    
    Pull Request resolved: pytorch#126496
    Approved by: https://github.com/d4l3k
    ktsiam authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    4de26b7 View commit details
    Browse the repository at this point in the history
  200. [AOTI] Flag to include aoti sources when building lite interpreter (p…

    …ytorch#126572)
    
    Summary:
    Added USE_LITE_AOTI cmake flag, which is turned OFF by default.
    When it is turned on, the AOTI sources  (inductor_core_resources) are included when building lite interpreter
    
    Test Plan:
    ```
    ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON
    ```
    
    Differential Revision: D57394078
    
    Pull Request resolved: pytorch#126572
    Approved by: https://github.com/malfet
    manuelcandales authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    fbf8018 View commit details
    Browse the repository at this point in the history
  201. [Pipelining] Fix 1f1b schedule (pytorch#126419)

    This schedule was running fine locally but failing (hanging) on CI.
    
    After analysis (https://fburl.com/gdoc/xt80h1gd), it seems like the
    schedule was not correct previously but may still work depending on the
    runtime.
    
    The fix bundles together fwd-recv(s->s+1) and bwd-send(s+1->s) into one
    coalesced group so they would not block each other.
    
    Design drawing
    <img width="803" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/906a9a66-39ae-4a6a-bc1a-18b77eaaa784">
    
    Flight recorder traces show the same coalescing pattern as designed
    <img width="1013" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/ab10646e-eaef-4191-83dd-73f448876c27">
    
    Pull Request resolved: pytorch#126419
    Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
    wconstab authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    492ef49 View commit details
    Browse the repository at this point in the history
  202. Configuration menu
    Copy the full SHA
    b6caa15 View commit details
    Browse the repository at this point in the history
  203. gitmodules: switch cpp-httplib to https (pytorch#126580)

    Fixes issue introduced in pytorch#126470 (comment)
    
    Test plan:
    
    CI
    Pull Request resolved: pytorch#126580
    Approved by: https://github.com/PaliC, https://github.com/jeffdaily
    d4l3k authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    d288e44 View commit details
    Browse the repository at this point in the history
  204. [pipelining] Follow improvements in export.unflatten (pytorch#126217)

    Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py.
    
    But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs.
    
    Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy.
    
    Pull Request resolved: pytorch#126217
    Approved by: https://github.com/H-Huang
    kwen2501 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    68ff312 View commit details
    Browse the repository at this point in the history
  205. [Submodule] Remove third-party CUB (pytorch#126540)

    Because it was updated 4 years ago, and now all supported CUDA versions provide CUB.
    
    Pull Request resolved: pytorch#126540
    Approved by: https://github.com/Skylion007
    cyyever authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    743df86 View commit details
    Browse the repository at this point in the history
  206. [halide-backend] Refactor codegen/triton.py into codegen/simd.py (pyt…

    …orch#126415)
    
    This PR is primarily just moving stuff around.  It creates a new
    common baseclass for TritonCodegen and the (upcoming) HalideCodegen.
    
    Pull Request resolved: pytorch#126415
    Approved by: https://github.com/shunting314
    jansel authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    deb6f3f View commit details
    Browse the repository at this point in the history
  207. Faster(?) FP16 gemv kernel (pytorch#126297)

    Differential Revision: [D57369266](https://our.internmc.facebook.com/intern/diff/D57369266/)
    
    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D57369266/)!
    Pull Request resolved: pytorch#126297
    Approved by: https://github.com/malfet
    swolchok authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    8a7f719 View commit details
    Browse the repository at this point in the history
  208. [2/N] Non-Tensor: Scalar Support: Add scalar to the cache for eager-t…

    …hrough-torch.compile (pytorch#124070)
    
    Add scalar information to the kernel configuration.
    
    #### Additional Context
    Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
    
    However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
    
    Pull Request resolved: pytorch#124070
    Approved by: https://github.com/jansel, https://github.com/jgong5
    EikanWang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b51e6dd View commit details
    Browse the repository at this point in the history
  209. Map float8 types to uint8 for allgather (pytorch#126556)

    # Summary
    Different take on this one:
    pytorch#126338
    
    We should probably not allow this mapping for 'compute' ops e.g. reductions
    
    ### Corresponding fp8 PR
    pytorch-labs/float8_experimental#263
    
    Pull Request resolved: pytorch#126556
    Approved by: https://github.com/wanchaol
    drisspg authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    23b6ebd View commit details
    Browse the repository at this point in the history
  210. [Traceable FSDP2] Change from register_multi_grad_hook to per-tensor …

    …backward hook (pytorch#126350)
    
    As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~
    
    As discussed below, we will change eager path to use per-tensor backward hook as well.
    
    Pull Request resolved: pytorch#126350
    Approved by: https://github.com/awgu
    yf225 authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    b4a2288 View commit details
    Browse the repository at this point in the history
  211. Configuration menu
    Copy the full SHA
    b10f3dd View commit details
    Browse the repository at this point in the history
  212. Refactor variables / function names related to non-strict export (pyt…

    …orch#126458)
    
    Improve variable and function naming for better clarity: `non strict` --> `aten`.
    Pull Request resolved: pytorch#126458
    Approved by: https://github.com/angelayi
    jiashenC authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    d82bbb0 View commit details
    Browse the repository at this point in the history
  213. Updated test_torch.py to use new OptimizerInfo infrastructure (pytorc…

    …h#125538)
    
    Fixes pytorch#123451 (only addresses test_torch.py cases)
    
    This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure.
    
    I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations.
    
    ```
    $ lintrunner test/test_cuda.py
    ok No lint issues.
    ```
    
    Pull Request resolved: pytorch#125538
    Approved by: https://github.com/janeyx99
    gambiTarun authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6ed6142 View commit details
    Browse the repository at this point in the history
  214. Forward fix the failed new test from D57474327 (pytorch#126596)

    Summary: TSIA.  The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used:
    
    ```
    _________________________ ReproTests.test_issue126128 __________________________
    
    self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128>
    
        def test_issue126128(self):
            def fn():
                x = torch.randn(1, 10)
                y = torch.randn(10, 1)
                return torch.mm(x, y).sum()
    
            def fn2():
                x = torch.randn(10, 100)
                y = torch.randn(100, 10)
                return torch.mm(x, y).sum()
    
    >       with torch._inductor.utils.fresh_inductor_cache():
    E       AttributeError: module 'torch._inductor' has no attribute 'utils'
    ```
    
    Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'`
    
    Differential Revision: D57516676
    
    Pull Request resolved: pytorch#126596
    Approved by: https://github.com/xmfan
    huydhn authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0e59bd4 View commit details
    Browse the repository at this point in the history
  215. Cached required_fw_nodes creation (pytorch#126613)

    Pull Request resolved: pytorch#126613
    Approved by: https://github.com/anijain2305
    Chillee authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    367a0c5 View commit details
    Browse the repository at this point in the history
  216. Revert "[Dynamo] Treat integers stored on nn.Modules as dynamic (pyto…

    …rch#126466)"
    
    This reverts commit 6bb9d60.
    
    Reverted pytorch#126466 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the ONNX test failure looks legit, not flaky, as it starts failing in trunk https://hud.pytorch.org/pytorch/pytorch/commit/6bb9d6080d33c817fcbf9e5ae8a59b76812a53d2 ([comment](pytorch#126466 (comment)))
    pytorchmergebot authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0ac2cec View commit details
    Browse the repository at this point in the history
  217. Remove unnecessary implementations from MockHandler (pytorch#126511)

    Dead implementations are confusing and can cause bugs when people
    accidentally hit them.  Better for it to be missing.
    
    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126511
    Approved by: https://github.com/peterbell10, https://github.com/lezcano
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    197ebc5 View commit details
    Browse the repository at this point in the history
  218. UFMT torch.utils._sympy.functions (pytorch#126553)

    Signed-off-by: Edward Z. Yang <ezyang@meta.com>
    
    Pull Request resolved: pytorch#126553
    Approved by: https://github.com/lezcano, https://github.com/Skylion007
    ghstack dependencies: pytorch#126511
    ezyang authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    2d65795 View commit details
    Browse the repository at this point in the history
  219. Update hf_BirdBird periodic-dynamo-benchmarks results (pytorch#126414)

    can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing
    
    Pull Request resolved: pytorch#126414
    Approved by: https://github.com/ezyang
    xmfan authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    0d1108c View commit details
    Browse the repository at this point in the history
  220. Replace torch.library.impl_abstract with torch.library.register_fake (p…

    …ytorch#126606)
    
    To remove the disrupting warning
    ```
          warnings.warn("torch.library.impl_abstract was renamed to "
                        "torch.library.register_fake. Please use that instead; "
                        "we will remove torch.library.impl_abstract in a future "
                        "version of PyTorch.",
                        DeprecationWarning, stacklevel=2)
    ```
    
    Pull Request resolved: pytorch#126606
    Approved by: https://github.com/ezyang
    cyyever authored and ZelboK committed May 19, 2024
    Configuration menu
    Copy the full SHA
    454d0d4 View commit details
    Browse the repository at this point in the history