Allow linalg.lstsq to use svd to compute the result for rank deficient matrices. #125110

… when matrices are rank deficient.

Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>

…tion linalg_svd for computation

…lback.

Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>

Pull Request resolved: pytorch#125758 Aliased and unused params are currently an issue for strict-mode export. For a model like this: ``` def __init__(self): # ... self.alpha = nn.Parameter(torch.randn(4)) self.beta = self.alpha self.gamma = self.alpha def forward(self, x): return x + self.beta ``` Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module). This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass. Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too. Pull Request resolved: pytorch#125758 Approved by: https://github.com/zhxchen17

Differential Revision: [D56920738](https://our.internmc.facebook.com/intern/diff/D56920738) Pull Request resolved: pytorch#125455 Approved by: https://github.com/Chillee

…ytorch#126175) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/ Previously I did it in a crappy way using clone_input in the callback, but this results in tensors that don't have quite the same size/stride/storage offset and there was an internal test case where not having completely accurate information was causing a downstream problem in propagation. So now I make real tensors as similar to their fake equivalents as much as possible. Though... I don't bother with autograd lol. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126175 Approved by: https://github.com/albanD

pytorch#126191) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#126191 Approved by: https://github.com/yoyoyocmu, https://github.com/yanboliang, https://github.com/fegin

Pull Request resolved: pytorch#126192 Approved by: https://github.com/yanboliang ghstack dependencies: pytorch#126191

This reverts commit 037615b. Reverted pytorch#124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](pytorch#124021 (comment)))

This reverts commit 5fb4a76. Reverted pytorch#125944 on behalf of https://github.com/nWEIdia due to test failure seems related https://hud.pytorch.org/pytorch/pytorch/commit/5fb4a766b88bcf633a23610bd66de0f3020f7c66 https://github.com/pytorch/pytorch/actions/runs/9085206167/job/24972040039 ([comment](pytorch#125944 (comment)))

As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here. Pull Request resolved: pytorch#126120 Approved by: https://github.com/aaronenyeshi

pytorch#125969) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: pytorch#125969 Approved by: https://github.com/d4l3k

# Summary I was getting ``` Shell File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a' ``` Pull Request resolved: pytorch#126185 Approved by: https://github.com/Skylion007

Add `ManualPipelineStage` under `_PipelineStage.py` Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible. Pull Request resolved: pytorch#126123 Approved by: https://github.com/kwen2501

Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff. Test Plan: Existing tests. Pull Request resolved: pytorch#125267 Approved by: https://github.com/Chillee

Adds trace_subgraph to _MakefxTracer, the motivation is in pytorch#122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph. **Test Plan:** Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes. Also fixes pytorch#124643 Pull Request resolved: pytorch#125363 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#125267

Fixes pytorch#126026 Pull Request resolved: pytorch#126196 Approved by: https://github.com/anijain2305

Summary: Forward fix D57251348 Test Plan: `buck2 test 'fbcode//mode/dev' fbcode//executorch/kernels/test:aten_op_copy_test` Differential Revision: D57304360 Pull Request resolved: pytorch#126124 Approved by: https://github.com/bdhirsh

Summary: Move const strings to top of file. This is in preparation of tooling to make use of shared constants (e.g. version string). A non-functional change. Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change. Test Plan: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: pytorch#125640 Approved by: https://github.com/wconstab

## static shapes perf ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------| | Average | 0.692 | | | | | | | | | Max | 0.855 | 16 | 16 | 4096 | 4096 | 64 | head_bias | torch.bfloat16 | | Min | 0.419 | 8 | 16 | 512 | 512 | 256 | noop | torch.bfloat16 | ``` ## dynamic shapes perf ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------| | Average | 0.670 | | | | | | | | | Max | 0.864 | 16 | 16 | 4096 | 4096 | 64 | relative_bias | torch.bfloat16 | | Min | 0.376 | 8 | 16 | 512 | 512 | 256 | relative_bias | torch.bfloat16 | ``` Pull Request resolved: pytorch#125994 Approved by: https://github.com/Chillee

…torch#125972) If I do: ``` xla_device = xm.xla_device() xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device) ``` I got the error: ``` RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16 ``` This PR intends to fix this issue. The data type can be found in pytorch/c10/core/ScalarType.h. Pull Request resolved: pytorch#125972 Approved by: https://github.com/JackCaoG

…#126171) For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support. Pull Request resolved: pytorch#126171 Approved by: https://github.com/ezyang

…ytorch#125822) Fixes pytorch#125752 Pull Request resolved: pytorch#125822 Approved by: https://github.com/aaronenyeshi

Skip this test case due to unaligned behavior to CUDA for Triton `mask_load`. We submitted issue pytorch#126173 to elaborate on the root cause. We intend to skip this case for XPU first as we need to take some time to fix the issue and have full validation to update the Triton commit pin for Intel GPU. Pull Request resolved: pytorch#126157 Approved by: https://github.com/EikanWang, https://github.com/peterbell10, https://github.com/desertfire

…tion (pytorch#126234) Pull Request resolved: pytorch#126234 Approved by: https://github.com/ezyang

Fixes pytorch#121932 Pull Request resolved: pytorch#124673 Approved by: https://github.com/eellison, https://github.com/eqy

…ag (pytorch#125917) Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat Pull Request resolved: pytorch#125917 Approved by: https://github.com/desertfire

…art-latency (pytorch#125953) Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes. Pull Request resolved: pytorch#125953 Approved by: https://github.com/desertfire ghstack dependencies: pytorch#125917

Summary: Not sure which to choose, so my criteria was: 1) We care about huggingface as part of internal milestones 2) This handful of models seems to particularly benefite from caching Pull Request resolved: pytorch#125955 Approved by: https://github.com/desertfire ghstack dependencies: pytorch#125917, pytorch#125953

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#126248 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#126184 Approved by: https://github.com/msaroufim

This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC pytorch#125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes | Benchmark | torchbench | huggingface | timm_models | |------------|-------------|--------------|--------------| | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x | | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x | | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x | Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes | Benchmark | torchbench | huggingface | timm_models | | --- | --- | --- | --- | | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x | | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x | | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x | Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Pull Request resolved: pytorch#124021 Approved by: https://github.com/jansel

Fixes issues encountered in pytorch#121956 Pull Request resolved: pytorch#125944 Approved by: https://github.com/atalman

Internal xref https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/ In particular, when we're collecting forward metadata, we aren't going to discharge any of the pending, so we'll be continuously collecting more and more pending symbols that we may not be able to resolve. This is fine. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126239 Approved by: https://github.com/lezcano

Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping. Pull Request resolved: pytorch#125962 Approved by: https://github.com/chenyang78

Summary: They appear in some unit tests. Pull Request resolved: pytorch#126013 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#125962

…notations (pytorch#124179) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI New Snapshot Generated: devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations: ``` [[{'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168556, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168738, 'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168865, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168920, 'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]}, {'action': 'alloc', 'addr': 140166073581568, 'size': 3211264, 'stream': 0, 'time_us': 1713558427172978, 'frames': [{'name': '_conv_forward', 'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv ``` Differential Revision: D55941362 Pulled By: aaronenyeshi Pull Request resolved: pytorch#124179 Approved by: https://github.com/zdevito

…` and some files (pytorch#125747) Part of: pytorch#123062 Ran lintrunner on: - test/test_fake_tensor.py - test/test_flop_counter.py - test/test_function_schema.py - test/test_functional_autograd_benchmark.py - test/test_functional_optim.py - test/test_functionalization_of_rng_ops.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: pytorch#125747 Approved by: https://github.com/malfet

We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors. Pull Request resolved: pytorch#126261 Approved by: https://github.com/EikanWang, https://github.com/peterbell10

Summary: Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: pytorch/PiPPy#1104 (comment) I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here. Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion Differential Revision: D56951634 Pull Request resolved: pytorch#125628 Approved by: https://github.com/tugsbayasgalan

…ction annotations (pytorch#124179)" This reverts commit 187aeae. Reverted pytorch#124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 https://hud.pytorch.org/pytorch/pytorch/commit/187aeaeabf612824c2d0e9be72f80ce6612760d4, test was skipped due to bad TD ([comment](pytorch#124179 (comment)))

Too lazy to figure out actual time reduction here, I'll figure it out later. Also I'd rather get an average of a couple of runs on trunk rather than just this one PR Things got faster. Source? Trust me bro * rel to pytorch#125598 Pull Request resolved: pytorch#125932 Approved by: https://github.com/ZainRizvi

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126295 Approved by: https://github.com/atalman

Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) ) Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`) Before the change, on M2 Pro I get 50 tokens per sec After adding a very naive ```metal template<typename T> kernel void int8pack_mm( constant T * A [[buffer(0)]], constant char * B [[buffer(1)]], constant T * scales [[buffer(2)]], device T * outputData [[buffer(3)]], constant uint3 & sizes [[buffer(4)]], uint thread_index [[thread_position_in_grid]]) { const uint lda = sizes.y; const uint ldc = sizes.z; const uint m = thread_index / sizes.z; // 0..sizes.x-1 const uint n = thread_index % sizes.z; // 0..sizes.z-1 constant T *A_ptr = A + m * lda; constant char *B_ptr = B + n * lda; float rc = 0.0; for(uint k = 0; k < sizes.y; k++) { const auto a_val = float(A_ptr[k]); const auto b_val = float(B_ptr[k]); rc += a_val * b_val; } outputData[thread_index] = T(rc * float(scales[n])); } ``` Perf dropped down to sad 15 tokens per seconds. Replacing inner loop with vectorized operations ```metal float rc = 0.0; for(uint k = 0; k < sizes.y/4; k++) { const auto a_val = float4(A_ptr[k]); const auto b_val = float4(B_ptr[k]); rc += dot(a_val, b_val); } ``` Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf. Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain) There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in https://github.com/ml-explore/mlx/blob/631dfbe67309fb630795cd612739cbe54c75e222/mlx/backend/metal/kernels/gemv.metal#L184 which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test Pull Request resolved: pytorch#125704 Approved by: https://github.com/mikekgfb

…default (pytorch#123394) Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users. Pull Request resolved: pytorch#123394 Approved by: https://github.com/wanchaol

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126222 Approved by: https://github.com/williamwen42, https://github.com/anijain2305

- Only search for magma if it is used (GPU builds) - Don't report it was not found when it isn't searched for - Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported) Pull Request resolved: pytorch#117858 Approved by: https://github.com/malfet

This PR is part of an effort to speed up torch.onnx.export (pytorch#121422). - Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export. - Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity. - This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it. - Resolves (9) in pytorch#121422. (partial fix of pytorch#121422) Pull Request resolved: pytorch#123029 Approved by: https://github.com/justinchuby

@aorenste

… in post_compile (pytorch#125854) This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor. As @aorenste points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile. Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves. Currently, wrappers come in two categories: 1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile 2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata. So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR. Pull Request resolved: pytorch#125854 Approved by: https://github.com/aorenste, https://github.com/bdhirsh

…ytorch#125773) Relanding just the pad in a single pass portion of [the pr](pytorch#118522). Not including the transpose logic: This was previously accepted and reviewed. Pull Request resolved: pytorch#125773 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#125772

For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1` Pull Request resolved: pytorch#125780 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#125772, pytorch#125773

…126081) Summary: Fixes: - executorch test - torchrec test Test Plan: CI Differential Revision: D57282304 Pull Request resolved: pytorch#126081 Approved by: https://github.com/angelayi

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126212 Approved by: https://github.com/Skylion007

Fixes some files in pytorch#123062 Run lintrunner on files: test/test_utils.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. Pull Request resolved: pytorch#125996 Approved by: https://github.com/ezyang

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere

The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: pytorch#126103 Approved by: https://github.com/suo

1. This fixes an issue where we had 9 ranks in one node and 7 in the other. 2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`. Pull Request resolved: pytorch#126288 Approved by: https://github.com/weifengpy

This reverts commit a961e1a. Reverted pytorch#126290 on behalf of https://github.com/malfet due to Indeed lint is broken :/ ([comment](pytorch#126290 (comment)))

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126292 Approved by: https://github.com/Skylion007

Checks for existing checkpoints and overwrites, based on an `overwrite` flag Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/) Pull Request resolved: pytorch#125877 Approved by: https://github.com/fegin

Skip the newly added bad API from pytorch#126212 to keep CI green. Pull Request resolved: pytorch#126321 Approved by: https://github.com/ezyang

after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3 Pull Request resolved: pytorch#125933 Approved by: https://github.com/ZainRizvi

…ng (pytorch#126267) unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`` with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error Pull Request resolved: pytorch#126267 Approved by: https://github.com/awgu

…26203) An alternative was pytorch#124975. Though it was safer because it was adding guards for every inlined function, it was causing guard overhead for a few models of > 20%. The overhead of this PR is minimal for the common unpatched case. Fixes an internal issue - [fb.workplace.com/groups/1075192433118967/permalink/1411067766198097](https://fb.workplace.com/groups/1075192433118967/permalink/1411067766198097/) Pull Request resolved: pytorch#126203 Approved by: https://github.com/ezyang

This PR is part of an effort to speed up torch.onnx.export (pytorch#121422). - The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not. - However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior. - Resolves (2) in pytorch#121422. Pull Request resolved: pytorch#123026 Approved by: https://github.com/justinchuby

Fixes pytorch#124464 Pull Request resolved: pytorch#126294 Approved by: https://github.com/mikaylagawarecki, https://github.com/drisspg

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126280 Approved by: https://github.com/mikaylagawarecki

…h#125830) Pull Request resolved: pytorch#125830 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/huydhn, https://github.com/desertfire, https://github.com/atalman

Follows the recent changes of Caffe2. Pull Request resolved: pytorch#126035 Approved by: https://github.com/r-barnes, https://github.com/Skylion007

Part of: pytorch#123062 Ran lintrunner on: - `test/test_datapipe.py` Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#124994 Approved by: https://github.com/mikaylagawarecki

@tinglvv

Seems to be supported now CC @tinglvv @nWEIdia @Aidyn-A Pull Request resolved: pytorch#125883 Approved by: https://github.com/Chillee, https://github.com/Aidyn-A

> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor. - [X] Ill assumption that every param will have the same step. - [x] DIfferent implementation between `foreach=Ture` and `foreach=False`. Pull Request resolved: pytorch#125440 Approved by: https://github.com/janeyx99

Summary: Encountered module import error when running triton kernel file. The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench". Test Plan: LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=v100,a100,h100 -c fbcode.split-dwarf=true caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 | tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py file ran successfully Differential Revision: D57345619 Pull Request resolved: pytorch#126213 Approved by: https://github.com/shunting314

Fixes pytorch#125866 Pull Request resolved: pytorch#125882 Approved by: https://github.com/jansel

…125943) Fixes pytorch#125942 Pull Request resolved: pytorch#125943 Approved by: https://github.com/jansel ghstack dependencies: pytorch#125882

Fixes pytorch#93624 but also requires jcmgray/autoray#20 to be fixed. Pull Request resolved: pytorch#125945 Approved by: https://github.com/jansel ghstack dependencies: pytorch#125882, pytorch#125943

…odules (pytorch#126303) Pull Request resolved: pytorch#126303 Approved by: https://github.com/mlazos, https://github.com/laithsakka

support fully_shard(model_on_meta, cpu_offload) when fully_shard is placed outside of `torch.device("meta")` Pull Request resolved: pytorch#126305 Approved by: https://github.com/awgu ghstack dependencies: pytorch#126267

Now you can print arbitrary values at compile time with comptime.print() Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126299 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126292

…pytorch#126289) I think I also need to fix this in fbcode, leaving that for future work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126289 Approved by: https://github.com/yanboliang

Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/ There a few improvements here, which luckily fix some xfails: * In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it. * We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor. * I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126281 Approved by: https://github.com/Skylion007

Fixes pytorch#126012. `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs. If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then. Pull Request resolved: pytorch#126030 Approved by: https://github.com/albanD

…rams (pytorch#126316) Pull Request resolved: pytorch#126316 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303

…6142) **As title.** Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](https://github.com/pytorch/pytorch/blob/04877dc430a6e93765471b28f422bf3e81d02c9e/torch/_functorch/aot_autograd.py#L456-459) only. **Motivation** We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier. **Testing** CI Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](https://github.com/pytorch/pytorch/blob/04877dc430a6e93765471b28f422bf3e81d02c9e/torch/onnx/_internal/exporter.py#L817) or not. Pull Request resolved: pytorch#126142 Approved by: https://github.com/angelayi

Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: pytorch#124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99

) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126342 Approved by: https://github.com/drisspg

…f mod instance (pytorch#126314) Pull Request resolved: pytorch#126314 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303, pytorch#126316

…r nlining work (pytorch#126327) Pull Request resolved: pytorch#126327 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303, pytorch#126316, pytorch#126314

…ion (pytorch#126251) Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class. Pull Request resolved: pytorch#126251 Approved by: https://github.com/peterbell10

This reverts commit 06d6bb4. Reverted pytorch#126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with pytorch#125995. Please help rebase and I will reland the change ([comment](pytorch#126030 (comment)))

As part of pytorch#125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: pytorch#126019 Approved by: https://github.com/jansel

The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied. Pull Request resolved: pytorch#126240 Approved by: https://github.com/jansel

This reverts commit 0116ffa. Reverted pytorch#125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](pytorch#125995 (comment)))

@eellison

…mpy tensors (pytorch#126246) Fixes speech_transformer regression here - https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?startTime=Tue%2C%2007%20May%202024%2019%3A22%3A54%20GMT&stopTime=Tue%2C%2014%20May%202024%2019%3A22%3A54%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=02093b6c6ae1046368e2500881d0bb5880873386&rBranch=main&rCommit=b24ad7eab55eaf660893dddae949fc714e434338 Thanks to @eellison and @bdhirsh for isolating the regression to nn module guards. Pull Request resolved: pytorch#126246 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126203

Fixes pytorch#121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: pytorch#121799 Also need this to unblock pytorch#123394 Pull Request resolved: pytorch#123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu

…ue fusion (pytorch#126068) As part of pytorch#125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: pytorch#126068 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126019

Summary: This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568 This algorithm is going to be used by multiple people, hence we need make it official implementation. Differential Revision: D57227565 Pull Request resolved: pytorch#126153 Approved by: https://github.com/jerryzh168

This reverts commit 2c5ad9a. Reverted pytorch#125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](pytorch#125440 (comment)))

This reverts commit 175c18a. Reverted pytorch#126153 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit because there are more than one lint issues, torch/optim/asgd.py is just the last one ([comment](pytorch#126153 (comment)))

# Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: pytorch#123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.991 | | | | | Max | 1.182 | (16, 16, 4096, 64) | noop | torch.bfloat16 | | Min | 0.796 | (2, 16, 512, 256) | head_bias | torch.bfloat16 | BWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.291 | | | | | Max | 0.652 | (8, 16, 512, 64) | head_bias | torch.bfloat16 | | Min | 0.073 | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | <details> <summary>Full Data</summary> | shape | score_mod | dtype | fwd_eager_time | fwd_compiled_time | bwd_eager_time | bwd_compiled_time | fwd_speedup | bwd_speedup | |---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------| | (2, 16, 512, 64) | noop | torch.bfloat16 | 19.936 | 19.092 | 57.851 | 193.564 | 1.044 | 0.299 | | (2, 16, 512, 64) | causal_mask | torch.bfloat16 | 19.955 | 19.497 | 57.662 | 206.278 | 1.024 | 0.280 | | (2, 16, 512, 64) | relative_bias | torch.bfloat16 | 19.455 | 21.297 | 57.674 | 195.219 | 0.913 | 0.295 | | (2, 16, 512, 64) | head_bias | torch.bfloat16 | 19.958 | 21.289 | 57.674 | 193.859 | 0.938 | 0.298 | | (2, 16, 512, 128) | noop | torch.bfloat16 | 28.157 | 28.615 | 82.831 | 454.211 | 0.984 | 0.182 | | (2, 16, 512, 128) | causal_mask | torch.bfloat16 | 28.154 | 28.444 | 83.091 | 432.083 | 0.990 | 0.192 | | (2, 16, 512, 128) | relative_bias | torch.bfloat16 | 28.722 | 27.897 | 83.175 | 446.789 | 1.030 | 0.186 | | (2, 16, 512, 128) | head_bias | torch.bfloat16 | 28.299 | 27.673 | 83.052 | 459.179 | 1.023 | 0.181 | | (2, 16, 512, 256) | noop | torch.bfloat16 | 41.167 | 50.504 | 175.019 | 1083.545 | 0.815 | 0.162 | | (2, 16, 512, 256) | causal_mask | torch.bfloat16 | 41.656 | 51.933 | 175.078 | 1171.176 | 0.802 | 0.149 | | (2, 16, 512, 256) | relative_bias | torch.bfloat16 | 41.697 | 50.722 | 175.159 | 1097.312 | 0.822 | 0.160 | | (2, 16, 512, 256) | head_bias | torch.bfloat16 | 41.690 | 52.387 | 175.184 | 1097.336 | 0.796 | 0.160 | | (2, 16, 1024, 64) | noop | torch.bfloat16 | 39.232 | 37.454 | 127.847 | 612.430 | 1.047 | 0.209 | | (2, 16, 1024, 64) | causal_mask | torch.bfloat16 | 39.930 | 39.599 | 127.755 | 665.359 | 1.008 | 0.192 | | (2, 16, 1024, 64) | relative_bias | torch.bfloat16 | 39.417 | 41.304 | 127.902 | 614.990 | 0.954 | 0.208 | | (2, 16, 1024, 64) | head_bias | torch.bfloat16 | 39.965 | 42.034 | 127.953 | 613.273 | 0.951 | 0.209 | | (2, 16, 1024, 128) | noop | torch.bfloat16 | 63.964 | 71.024 | 226.510 | 1637.669 | 0.901 | 0.138 | | (2, 16, 1024, 128) | causal_mask | torch.bfloat16 | 63.843 | 72.451 | 226.750 | 1558.949 | 0.881 | 0.145 | | (2, 16, 1024, 128) | relative_bias | torch.bfloat16 | 64.301 | 70.487 | 226.651 | 1610.063 | 0.912 | 0.141 | | (2, 16, 1024, 128) | head_bias | torch.bfloat16 | 64.033 | 71.394 | 226.676 | 1668.511 | 0.897 | 0.136 | | (2, 16, 1024, 256) | noop | torch.bfloat16 | 129.348 | 141.390 | 507.337 | 4405.175 | 0.915 | 0.115 | | (2, 16, 1024, 256) | causal_mask | torch.bfloat16 | 129.538 | 145.680 | 507.178 | 4768.874 | 0.889 | 0.106 | | (2, 16, 1024, 256) | relative_bias | torch.bfloat16 | 129.438 | 142.782 | 507.004 | 4401.002 | 0.907 | 0.115 | | (2, 16, 1024, 256) | head_bias | torch.bfloat16 | 129.058 | 146.242 | 507.547 | 4434.251 | 0.883 | 0.114 | | (2, 16, 4096, 64) | noop | torch.bfloat16 | 481.606 | 409.120 | 1440.890 | 14147.269 | 1.177 | 0.102 | | (2, 16, 4096, 64) | causal_mask | torch.bfloat16 | 480.227 | 438.847 | 1434.419 | 14973.386 | 1.094 | 0.096 | | (2, 16, 4096, 64) | relative_bias | torch.bfloat16 | 480.831 | 458.104 | 1432.935 | 14193.253 | 1.050 | 0.101 | | (2, 16, 4096, 64) | head_bias | torch.bfloat16 | 480.749 | 452.497 | 1437.040 | 14084.869 | 1.062 | 0.102 | | (2, 16, 4096, 128) | noop | torch.bfloat16 | 872.534 | 848.275 | 2600.895 | 35156.849 | 1.029 | 0.074 | | (2, 16, 4096, 128) | causal_mask | torch.bfloat16 | 872.647 | 868.279 | 2587.581 | 31919.531 | 1.005 | 0.081 | | (2, 16, 4096, 128) | relative_bias | torch.bfloat16 | 871.484 | 827.644 | 2593.989 | 34805.634 | 1.053 | 0.075 | | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | 871.422 | 856.437 | 2602.482 | 35708.591 | 1.017 | 0.073 | | (2, 16, 4096, 256) | noop | torch.bfloat16 | 1904.497 | 1758.183 | 6122.416 | 66754.593 | 1.083 | 0.092 | | (2, 16, 4096, 256) | causal_mask | torch.bfloat16 | 1911.174 | 1762.821 | 6113.207 | 72759.392 | 1.084 | 0.084 | | (2, 16, 4096, 256) | relative_bias | torch.bfloat16 | 1911.254 | 1727.108 | 6123.530 | 66577.988 | 1.107 | 0.092 | | (2, 16, 4096, 256) | head_bias | torch.bfloat16 | 1916.977 | 1801.804 | 6118.158 | 67359.680 | 1.064 | 0.091 | | (8, 16, 512, 64) | noop | torch.bfloat16 | 44.984 | 43.974 | 170.276 | 262.259 | 1.023 | 0.649 | | (8, 16, 512, 64) | causal_mask | torch.bfloat16 | 45.001 | 46.265 | 170.509 | 274.893 | 0.973 | 0.620 | | (8, 16, 512, 64) | relative_bias | torch.bfloat16 | 45.466 | 48.211 | 170.606 | 262.759 | 0.943 | 0.649 | | (8, 16, 512, 64) | head_bias | torch.bfloat16 | 45.481 | 48.435 | 170.267 | 261.265 | 0.939 | 0.652 | | (8, 16, 512, 128) | noop | torch.bfloat16 | 72.565 | 74.736 | 313.220 | 773.126 | 0.971 | 0.405 | | (8, 16, 512, 128) | causal_mask | torch.bfloat16 | 72.015 | 75.755 | 313.311 | 775.513 | 0.951 | 0.404 | | (8, 16, 512, 128) | relative_bias | torch.bfloat16 | 72.105 | 74.189 | 313.806 | 769.238 | 0.972 | 0.408 | | (8, 16, 512, 128) | head_bias | torch.bfloat16 | 72.005 | 74.364 | 313.509 | 775.237 | 0.968 | 0.404 | | (8, 16, 512, 256) | noop | torch.bfloat16 | 138.656 | 165.453 | 663.707 | 2672.067 | 0.838 | 0.248 | | (8, 16, 512, 256) | causal_mask | torch.bfloat16 | 139.096 | 172.613 | 663.593 | 2926.538 | 0.806 | 0.227 | | (8, 16, 512, 256) | relative_bias | torch.bfloat16 | 139.500 | 168.417 | 663.938 | 2658.629 | 0.828 | 0.250 | | (8, 16, 512, 256) | head_bias | torch.bfloat16 | 139.776 | 173.549 | 662.920 | 2667.266 | 0.805 | 0.249 | | (8, 16, 1024, 64) | noop | torch.bfloat16 | 134.883 | 125.004 | 484.706 | 1195.254 | 1.079 | 0.406 | | (8, 16, 1024, 64) | causal_mask | torch.bfloat16 | 134.297 | 132.875 | 485.420 | 1234.953 | 1.011 | 0.393 | | (8, 16, 1024, 64) | relative_bias | torch.bfloat16 | 134.839 | 139.231 | 485.470 | 1198.556 | 0.968 | 0.405 | | (8, 16, 1024, 64) | head_bias | torch.bfloat16 | 133.822 | 136.449 | 485.608 | 1189.198 | 0.981 | 0.408 | | (8, 16, 1024, 128) | noop | torch.bfloat16 | 235.470 | 234.765 | 886.094 | 2662.944 | 1.003 | 0.333 | | (8, 16, 1024, 128) | causal_mask | torch.bfloat16 | 236.305 | 241.382 | 886.293 | 2646.984 | 0.979 | 0.335 | | (8, 16, 1024, 128) | relative_bias | torch.bfloat16 | 236.414 | 233.980 | 885.250 | 2642.178 | 1.010 | 0.335 | | (8, 16, 1024, 128) | head_bias | torch.bfloat16 | 237.176 | 239.040 | 885.754 | 2665.242 | 0.992 | 0.332 | | (8, 16, 1024, 256) | noop | torch.bfloat16 | 504.445 | 517.855 | 1978.956 | 9592.906 | 0.974 | 0.206 | | (8, 16, 1024, 256) | causal_mask | torch.bfloat16 | 502.428 | 536.002 | 1978.611 | 10607.342 | 0.937 | 0.187 | | (8, 16, 1024, 256) | relative_bias | torch.bfloat16 | 503.396 | 523.960 | 1977.993 | 9539.284 | 0.961 | 0.207 | | (8, 16, 1024, 256) | head_bias | torch.bfloat16 | 503.818 | 536.014 | 1980.131 | 9576.262 | 0.940 | 0.207 | | (8, 16, 4096, 64) | noop | torch.bfloat16 | 1970.139 | 1674.930 | 5750.940 | 16724.134 | 1.176 | 0.344 | | (8, 16, 4096, 64) | causal_mask | torch.bfloat16 | 1959.036 | 1775.056 | 5780.512 | 17390.350 | 1.104 | 0.332 | | (8, 16, 4096, 64) | relative_bias | torch.bfloat16 | 1947.198 | 1773.869 | 5780.643 | 16779.699 | 1.098 | 0.345 | | (8, 16, 4096, 64) | head_bias | torch.bfloat16 | 1963.935 | 1829.502 | 5780.018 | 16703.259 | 1.073 | 0.346 | | (8, 16, 4096, 128) | noop | torch.bfloat16 | 3582.711 | 3362.623 | 10436.069 | 36415.565 | 1.065 | 0.287 | | (8, 16, 4096, 128) | causal_mask | torch.bfloat16 | 3581.504 | 3499.472 | 10346.869 | 36164.959 | 1.023 | 0.286 | | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 | 3589.779 | 3337.849 | 10529.621 | 36261.696 | 1.075 | 0.290 | | (8, 16, 4096, 128) | head_bias | torch.bfloat16 | 3602.265 | 3436.444 | 10458.660 | 36507.790 | 1.048 | 0.286 | | (8, 16, 4096, 256) | noop | torch.bfloat16 | 7695.923 | 7126.275 | 24643.009 | 140949.081 | 1.080 | 0.175 | | (8, 16, 4096, 256) | causal_mask | torch.bfloat16 | 7679.939 | 7186.252 | 24538.105 | 157156.067 | 1.069 | 0.156 | | (8, 16, 4096, 256) | relative_bias | torch.bfloat16 | 7681.374 | 6994.832 | 24549.713 | 140077.179 | 1.098 | 0.175 | | (8, 16, 4096, 256) | head_bias | torch.bfloat16 | 7679.822 | 7212.278 | 24627.823 | 140675.003 | 1.065 | 0.175 | | (16, 16, 512, 64) | noop | torch.bfloat16 | 80.126 | 78.291 | 333.719 | 541.165 | 1.023 | 0.617 | | (16, 16, 512, 64) | causal_mask | torch.bfloat16 | 80.065 | 81.696 | 333.779 | 551.113 | 0.980 | 0.606 | | (16, 16, 512, 64) | relative_bias | torch.bfloat16 | 80.138 | 86.715 | 333.364 | 542.118 | 0.924 | 0.615 | | (16, 16, 512, 64) | head_bias | torch.bfloat16 | 80.415 | 85.204 | 333.294 | 536.840 | 0.944 | 0.621 | | (16, 16, 512, 128) | noop | torch.bfloat16 | 134.964 | 138.025 | 607.093 | 1333.102 | 0.978 | 0.455 | | (16, 16, 512, 128) | causal_mask | torch.bfloat16 | 134.192 | 141.523 | 606.269 | 1424.318 | 0.948 | 0.426 | | (16, 16, 512, 128) | relative_bias | torch.bfloat16 | 135.711 | 138.639 | 606.283 | 1327.974 | 0.979 | 0.457 | | (16, 16, 512, 128) | head_bias | torch.bfloat16 | 135.552 | 140.555 | 607.107 | 1347.370 | 0.964 | 0.451 | | (16, 16, 512, 256) | noop | torch.bfloat16 | 275.113 | 315.144 | 1301.583 | 5268.153 | 0.873 | 0.247 | | (16, 16, 512, 256) | causal_mask | torch.bfloat16 | 274.867 | 328.106 | 1302.513 | 5770.594 | 0.838 | 0.226 | | (16, 16, 512, 256) | relative_bias | torch.bfloat16 | 276.052 | 321.770 | 1302.904 | 5241.920 | 0.858 | 0.249 | | (16, 16, 512, 256) | head_bias | torch.bfloat16 | 271.409 | 328.839 | 1302.142 | 5266.037 | 0.825 | 0.247 | | (16, 16, 1024, 64) | noop | torch.bfloat16 | 260.489 | 237.463 | 955.884 | 1817.558 | 1.097 | 0.526 | | (16, 16, 1024, 64) | causal_mask | torch.bfloat16 | 262.378 | 254.350 | 955.280 | 1843.807 | 1.032 | 0.518 | | (16, 16, 1024, 64) | relative_bias | torch.bfloat16 | 261.338 | 268.253 | 956.038 | 1820.036 | 0.974 | 0.525 | | (16, 16, 1024, 64) | head_bias | torch.bfloat16 | 262.153 | 264.156 | 956.023 | 1810.076 | 0.992 | 0.528 | | (16, 16, 1024, 128) | noop | torch.bfloat16 | 476.475 | 461.413 | 1760.578 | 4306.521 | 1.033 | 0.409 | | (16, 16, 1024, 128) | causal_mask | torch.bfloat16 | 473.794 | 479.178 | 1761.277 | 4619.439 | 0.989 | 0.381 | | (16, 16, 1024, 128) | relative_bias | torch.bfloat16 | 473.839 | 463.282 | 1758.692 | 4290.562 | 1.023 | 0.410 | | (16, 16, 1024, 128) | head_bias | torch.bfloat16 | 472.979 | 472.896 | 1763.086 | 4367.931 | 1.000 | 0.404 | | (16, 16, 1024, 256) | noop | torch.bfloat16 | 1014.184 | 1026.764 | 3922.997 | 19104.147 | 0.988 | 0.205 | | (16, 16, 1024, 256) | causal_mask | torch.bfloat16 | 1013.217 | 1039.046 | 3928.382 | 21086.281 | 0.975 | 0.186 | | (16, 16, 1024, 256) | relative_bias | torch.bfloat16 | 1008.519 | 1015.278 | 3922.133 | 18980.652 | 0.993 | 0.207 | | (16, 16, 1024, 256) | head_bias | torch.bfloat16 | 1011.360 | 1047.542 | 3931.245 | 19069.172 | 0.965 | 0.206 | | (16, 16, 4096, 64) | noop | torch.bfloat16 | 3929.850 | 3325.667 | 11411.704 | 23344.280 | 1.182 | 0.489 | | (16, 16, 4096, 64) | causal_mask | torch.bfloat16 | 3885.262 | 3581.544 | 11390.515 | 23725.639 | 1.085 | 0.480 | | (16, 16, 4096, 64) | relative_bias | torch.bfloat16 | 3865.737 | 3537.308 | 11489.901 | 23406.330 | 1.093 | 0.491 | | (16, 16, 4096, 64) | head_bias | torch.bfloat16 | 3880.530 | 3665.249 | 11484.411 | 23299.496 | 1.059 | 0.493 | | (16, 16, 4096, 128) | noop | torch.bfloat16 | 7030.306 | 6745.715 | 20621.264 | 57464.096 | 1.042 | 0.359 | | (16, 16, 4096, 128) | causal_mask | torch.bfloat16 | 7095.414 | 7034.385 | 20410.656 | 61660.511 | 1.009 | 0.331 | | (16, 16, 4096, 128) | relative_bias | torch.bfloat16 | 7084.779 | 6686.497 | 20315.161 | 57243.969 | 1.060 | 0.355 | | (16, 16, 4096, 128) | head_bias | torch.bfloat16 | 7075.367 | 6863.305 | 20494.385 | 58481.953 | 1.031 | 0.350 | | (16, 16, 4096, 256) | noop | torch.bfloat16 | 15612.741 | 14297.482 | 55306.847 | 281161.865 | 1.092 | 0.197 | | (16, 16, 4096, 256) | causal_mask | torch.bfloat16 | 15326.592 | 14263.878 | 55227.806 | 313063.232 | 1.075 | 0.176 | | (16, 16, 4096, 256) | relative_bias | torch.bfloat16 | 15297.963 | 14007.379 | 54558.029 | 279529.175 | 1.092 | 0.195 | | (16, 16, 4096, 256) | head_bias | torch.bfloat16 | 15216.160 | 14276.027 | 55081.581 | 280996.826 | 1.066 | 0.196 | </details> Pull Request resolved: pytorch#125515 Approved by: https://github.com/Chillee

CPP guard manager has been on for a few weeks now. This separate testing was part of phasing when the cpp guard manager was not enabled. Now this is not needed. Pull Request resolved: pytorch#126343 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303, pytorch#126316, pytorch#126314, pytorch#126327

@yifuwang

found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. pytorch#125366 Pull Request resolved: pytorch#125798 Approved by: https://github.com/yifuwang

And allow fusion of buffers where writes are only atomic accumulates. This allows fusing of ops like _unsafe_index_put(_unsafe_index_put(a, ...), ...) Pull Request resolved: pytorch#123223 Approved by: https://github.com/peterbell10

Summary: The NAN CHECK is done through device side assert without copying needed from GPU to CPU Test Plan: Unit test for collectives that should experience run time error (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$ python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered . ---------------------------------------------------------------------- Ran 1 test in 7.723s OK Tags: Pull Request resolved: pytorch#125726 Approved by: https://github.com/kwen2501

…126287) This means that propagate real tensor is no longer unsound: if the route we took at compile time diverges with runtime, you will get a runtime assert. Also add structured trace logs for these. Also fix bug where xreplace with int range is not guaranteed to return a sympy expression. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126287 Approved by: https://github.com/Skylion007

Fixes the following issue: D55803461 differs from the exported PR: pytorch#123658 ⚠️ this PR needs to be skipped on diff train! Pull Request resolved: pytorch#126357 Approved by: https://github.com/huydhn, https://github.com/fegin

Summary: This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met. These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own. Test Plan: Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /*window_size_left*/ Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test Differential Revision: D56938087 Pull Request resolved: pytorch#126061 Approved by: https://github.com/drisspg, https://github.com/desertfire

MYPY somehow shows lots of local failures for me. The issue is tracked in pytorch#126361. This is only to keep trunk sane. These two line were added by pytorch#126035 as an attempt to fix lint there, but didn't seem to help. Pull Request resolved: pytorch#126378 Approved by: https://github.com/kit1980

…operations with scalar input like alpha (pytorch#124177) Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`. Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - https://github.com/pytorch/pytorch/blob/0f6ce45bcbd7026c00da43db0317ede10830378b/torch/_inductor/compile_fx.py#L1048 This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters. Captured graph for `torch.add(a, b, alpha=2.0)` ``` opcode name target args kwargs ------------- -------- --------------- ---------------- -------------- placeholder arg0_1 arg0_1 () {} placeholder arg1_1 arg1_1 () {} call_function add aten.add.Tensor (arg0_1, arg1_1) {'alpha': 2.0} output output_1 output ((add,),) {} ``` Pull Request resolved: pytorch#124177 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5

# Motivation We generalize a device-agnostic API `torch.amp.autocast` in [pytorch#125103](pytorch#125103). After that, - `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and - `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)` no matter in eager mode or JIT mode. Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API. Pull Request resolved: pytorch#126062 Approved by: https://github.com/eqy, https://github.com/albanD

This reverts commit 5fa1f4c. Reverted pytorch#126378 on behalf of https://github.com/huydhn due to Trying to add yet another lint fix from https://hud.pytorch.org/pr/pytorch/pytorch/126357 and will reland this ([comment](pytorch#126378 (comment)))

This reverts commit 95b9e98. Reverted pytorch#125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory https://hud.pytorch.org/pytorch/pytorch/commit/95b9e981c3ab68fc17f78b8a6bbfd9569745ae4c ([comment](pytorch#125515 (comment)))

MYPY somehow shows lots of local failures for me. The issue is tracked in pytorch#126361. This is only to keep trunk sane. These two line were added by pytorch#126035 as an attempt to fix lint there, but didn't seem to help. Pull Request resolved: pytorch#126378 Approved by: https://github.com/kit1980

) This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`. It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage. The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users. Pull Request resolved: pytorch#126334 Approved by: https://github.com/yifuwang, https://github.com/wanchaol

The link is broken in https://pytorch.org/docs/main/community/design.html Pull Request resolved: pytorch#120972 Approved by: https://github.com/Skylion007

Reopen of pytorch#122472 ## Improvements This upgrade fixes the following issues: - pytorch#120982 This upgrade brings the following new features: - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (pytorch#114450) ## Validation results on CPU Original results with oneDNN v3.4.1 are here: pytorch#122472 (comment) Need to rerun validation and update results. Co-authored-by: Sunita Nadampalli <nadampal@amazon.com> Pull Request resolved: pytorch#126137 Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman

…126166) **Context** For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients). - FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`. - FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`. For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2). - FSDP2 offers (1) without any intervention like mentioned above. - FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above. - FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`. **Overview** For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like: ``` for microbatch_idx, microbatch in enumerate(microbatches): is_last_microbatch = microbatch_idx == len(microbatches) - 1 model.set_requires_all_reduce(is_last_microbatch) # Run forward/backward ``` This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only. **Developer Notes** We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output. Pull Request resolved: pytorch#126166 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: pytorch#126067, pytorch#126070, pytorch#126161

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere

) Fixes pytorch#125504 Fixes pytorch#126252 Fixes pytorch#126296 Fixes pytorch#126330 This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR. Pull Request resolved: pytorch#126336 Approved by: https://github.com/huydhn, https://github.com/pruthvistony

AMAX is coming as part of rocm6.2. This code adds that functionality Pull Request resolved: pytorch#125921 Approved by: https://github.com/eqy, https://github.com/lezcano

…h#121716) Pull Request resolved: pytorch#121716 Approved by: https://github.com/ezyang, https://github.com/huydhn

Summary: Support at::Generator which is used by many random number generator ops Pull Request resolved: pytorch#126181 Approved by: https://github.com/chenyang78

Summary: Move some util functions for cpp kernel naming and missing arg filling from FallbackKernel to ExternKernel, since they are useful for ExternKernel in general. Pull Request resolved: pytorch#126182 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#126181

…pytorch#126183) Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes pytorch#121809 Pull Request resolved: pytorch#126183 Approved by: https://github.com/angelayi ghstack dependencies: pytorch#126181, pytorch#126182

Summary: The logic has been repeated several times in the code, so it's worth to write a common util function. Pull Request resolved: pytorch#126352 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#126181, pytorch#126182, pytorch#126183

As title Differential Revision: [D57419445](https://our.internmc.facebook.com/intern/diff/D57419445/) Pull Request resolved: pytorch#126362 Approved by: https://github.com/awgu, https://github.com/Skylion007

As title Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/) Pull Request resolved: pytorch#126365 Approved by: https://github.com/awgu, https://github.com/Skylion007 ghstack dependencies: pytorch#126362

This PR is part of an effort to speed up torch.onnx.export (pytorch#121422). - For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state. - Resolves (6) in pytorch#121422. - Also see pytorch#123028 with a similar addition of a cache state. (partial fix of pytorch#121545) Pull Request resolved: pytorch#124912 Approved by: https://github.com/justinchuby

After pytorch#123308, we no longer need separate serialization path to handle different types that exist in the `nn_module` metadata. This PR cleans up the redundant code. Pull Request resolved: pytorch#126249 Approved by: https://github.com/angelayi

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126243 Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel

@jbschlosser

Fixes: pytorch#125529 BC-breaking note: The deprecated "async" argument to the Storage.cuda and Storage.hpu has been removed. Use non_blocking instead. CC: @jbschlosser, @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @albanD Pull Request resolved: pytorch#125530 Approved by: https://github.com/guangyey, https://github.com/albanD

) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126398 Approved by: https://github.com/jansel, https://github.com/peterbell10

) There's still another naughty direct test_* import, I'm out of patience right now though. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126397 Approved by: https://github.com/peterbell10, https://github.com/int3

Fixes pytorch#126012. `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs. If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then. Pull Request resolved: pytorch#126030 Approved by: https://github.com/albanD

…n compile (pytorch#126346) As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`. This won't affect eager and it's a compile-only change. Pull Request resolved: pytorch#126346 Approved by: https://github.com/awgu

Fixes pytorch#125992 The default value of the parameter `strict` should be `True`. Pull Request resolved: pytorch#125998 Approved by: https://github.com/fegin

Summary: Missed this in D57163341 Test Plan: CI Differential Revision: D57442088 Pull Request resolved: pytorch#126403 Approved by: https://github.com/zhxchen17

LoggingTensor fails consistently when root logger level is INFO or lower By default, root logger should be WARNING But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: pytorch#126143 Pull Request resolved: pytorch#126144 Approved by: https://github.com/jansel

FIXES pytorch#126128. Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again, usually fine in tests. Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd. TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't Pull Request resolved: pytorch#126146 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: pytorch#126144

…pytorch#126148) verbose flag leaks into tests ran after Pull Request resolved: pytorch#126148 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126144, pytorch#126146

Pull Request resolved: pytorch#126218 Approved by: https://github.com/huydhn, https://github.com/desertfire

Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D57246912 Pull Request resolved: pytorch#126308 Approved by: https://github.com/Skylion007

Summary: Add prefix arg so that users can provide the submodule name to partitioner. Test Plan: https://fburl.com/anp/2kue4qp9 Differential Revision: D57416926 Pull Request resolved: pytorch#126382 Approved by: https://github.com/SherlockNoMad

Summary: This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2. - Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`. - Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]` Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*". Test Plan: sandcastle Differential Revision: D57360438 Pull Request resolved: pytorch#126371 Approved by: https://github.com/bertmaher

Summary: This should further improve our debuggability Tags: Pull Request resolved: pytorch#126409 Approved by: https://github.com/XilunWu

Copy of pytorch#126089, with some additional fixes & tests Partial fix for pytorch#125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation. The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes: * (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying * (still kind of wrong): views won't actually alias each other after deepcopying. * (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias. BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++. Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306) Pull Request resolved: pytorch#126126 Approved by: https://github.com/ezyang

…o epilogue fusion (pytorch#126068)" This reverts commit 927e631. Reverted pytorch#126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR pytorch#124021 is going to be revert ([comment](pytorch#126019 (comment)))

…26019)" This reverts commit 7844c20. Reverted pytorch#126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR pytorch#124021 is going to be revert ([comment](pytorch#126019 (comment)))

This reverts commit f060b0c. Reverted pytorch#124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](pytorch#124021 (comment)))

# Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: pytorch#123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.991 | | | | | Max | 1.182 | (16, 16, 4096, 64) | noop | torch.bfloat16 | | Min | 0.796 | (2, 16, 512, 256) | head_bias | torch.bfloat16 | BWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.291 | | | | | Max | 0.652 | (8, 16, 512, 64) | head_bias | torch.bfloat16 | | Min | 0.073 | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | <details> <summary>Full Data</summary> | shape | score_mod | dtype | fwd_eager_time | fwd_compiled_time | bwd_eager_time | bwd_compiled_time | fwd_speedup | bwd_speedup | |---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------| | (2, 16, 512, 64) | noop | torch.bfloat16 | 19.936 | 19.092 | 57.851 | 193.564 | 1.044 | 0.299 | | (2, 16, 512, 64) | causal_mask | torch.bfloat16 | 19.955 | 19.497 | 57.662 | 206.278 | 1.024 | 0.280 | | (2, 16, 512, 64) | relative_bias | torch.bfloat16 | 19.455 | 21.297 | 57.674 | 195.219 | 0.913 | 0.295 | | (2, 16, 512, 64) | head_bias | torch.bfloat16 | 19.958 | 21.289 | 57.674 | 193.859 | 0.938 | 0.298 | | (2, 16, 512, 128) | noop | torch.bfloat16 | 28.157 | 28.615 | 82.831 | 454.211 | 0.984 | 0.182 | | (2, 16, 512, 128) | causal_mask | torch.bfloat16 | 28.154 | 28.444 | 83.091 | 432.083 | 0.990 | 0.192 | | (2, 16, 512, 128) | relative_bias | torch.bfloat16 | 28.722 | 27.897 | 83.175 | 446.789 | 1.030 | 0.186 | | (2, 16, 512, 128) | head_bias | torch.bfloat16 | 28.299 | 27.673 | 83.052 | 459.179 | 1.023 | 0.181 | | (2, 16, 512, 256) | noop | torch.bfloat16 | 41.167 | 50.504 | 175.019 | 1083.545 | 0.815 | 0.162 | | (2, 16, 512, 256) | causal_mask | torch.bfloat16 | 41.656 | 51.933 | 175.078 | 1171.176 | 0.802 | 0.149 | | (2, 16, 512, 256) | relative_bias | torch.bfloat16 | 41.697 | 50.722 | 175.159 | 1097.312 | 0.822 | 0.160 | | (2, 16, 512, 256) | head_bias | torch.bfloat16 | 41.690 | 52.387 | 175.184 | 1097.336 | 0.796 | 0.160 | | (2, 16, 1024, 64) | noop | torch.bfloat16 | 39.232 | 37.454 | 127.847 | 612.430 | 1.047 | 0.209 | | (2, 16, 1024, 64) | causal_mask | torch.bfloat16 | 39.930 | 39.599 | 127.755 | 665.359 | 1.008 | 0.192 | | (2, 16, 1024, 64) | relative_bias | torch.bfloat16 | 39.417 | 41.304 | 127.902 | 614.990 | 0.954 | 0.208 | | (2, 16, 1024, 64) | head_bias | torch.bfloat16 | 39.965 | 42.034 | 127.953 | 613.273 | 0.951 | 0.209 | | (2, 16, 1024, 128) | noop | torch.bfloat16 | 63.964 | 71.024 | 226.510 | 1637.669 | 0.901 | 0.138 | | (2, 16, 1024, 128) | causal_mask | torch.bfloat16 | 63.843 | 72.451 | 226.750 | 1558.949 | 0.881 | 0.145 | | (2, 16, 1024, 128) | relative_bias | torch.bfloat16 | 64.301 | 70.487 | 226.651 | 1610.063 | 0.912 | 0.141 | | (2, 16, 1024, 128) | head_bias | torch.bfloat16 | 64.033 | 71.394 | 226.676 | 1668.511 | 0.897 | 0.136 | | (2, 16, 1024, 256) | noop | torch.bfloat16 | 129.348 | 141.390 | 507.337 | 4405.175 | 0.915 | 0.115 | | (2, 16, 1024, 256) | causal_mask | torch.bfloat16 | 129.538 | 145.680 | 507.178 | 4768.874 | 0.889 | 0.106 | | (2, 16, 1024, 256) | relative_bias | torch.bfloat16 | 129.438 | 142.782 | 507.004 | 4401.002 | 0.907 | 0.115 | | (2, 16, 1024, 256) | head_bias | torch.bfloat16 | 129.058 | 146.242 | 507.547 | 4434.251 | 0.883 | 0.114 | | (2, 16, 4096, 64) | noop | torch.bfloat16 | 481.606 | 409.120 | 1440.890 | 14147.269 | 1.177 | 0.102 | | (2, 16, 4096, 64) | causal_mask | torch.bfloat16 | 480.227 | 438.847 | 1434.419 | 14973.386 | 1.094 | 0.096 | | (2, 16, 4096, 64) | relative_bias | torch.bfloat16 | 480.831 | 458.104 | 1432.935 | 14193.253 | 1.050 | 0.101 | | (2, 16, 4096, 64) | head_bias | torch.bfloat16 | 480.749 | 452.497 | 1437.040 | 14084.869 | 1.062 | 0.102 | | (2, 16, 4096, 128) | noop | torch.bfloat16 | 872.534 | 848.275 | 2600.895 | 35156.849 | 1.029 | 0.074 | | (2, 16, 4096, 128) | causal_mask | torch.bfloat16 | 872.647 | 868.279 | 2587.581 | 31919.531 | 1.005 | 0.081 | | (2, 16, 4096, 128) | relative_bias | torch.bfloat16 | 871.484 | 827.644 | 2593.989 | 34805.634 | 1.053 | 0.075 | | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | 871.422 | 856.437 | 2602.482 | 35708.591 | 1.017 | 0.073 | | (2, 16, 4096, 256) | noop | torch.bfloat16 | 1904.497 | 1758.183 | 6122.416 | 66754.593 | 1.083 | 0.092 | | (2, 16, 4096, 256) | causal_mask | torch.bfloat16 | 1911.174 | 1762.821 | 6113.207 | 72759.392 | 1.084 | 0.084 | | (2, 16, 4096, 256) | relative_bias | torch.bfloat16 | 1911.254 | 1727.108 | 6123.530 | 66577.988 | 1.107 | 0.092 | | (2, 16, 4096, 256) | head_bias | torch.bfloat16 | 1916.977 | 1801.804 | 6118.158 | 67359.680 | 1.064 | 0.091 | | (8, 16, 512, 64) | noop | torch.bfloat16 | 44.984 | 43.974 | 170.276 | 262.259 | 1.023 | 0.649 | | (8, 16, 512, 64) | causal_mask | torch.bfloat16 | 45.001 | 46.265 | 170.509 | 274.893 | 0.973 | 0.620 | | (8, 16, 512, 64) | relative_bias | torch.bfloat16 | 45.466 | 48.211 | 170.606 | 262.759 | 0.943 | 0.649 | | (8, 16, 512, 64) | head_bias | torch.bfloat16 | 45.481 | 48.435 | 170.267 | 261.265 | 0.939 | 0.652 | | (8, 16, 512, 128) | noop | torch.bfloat16 | 72.565 | 74.736 | 313.220 | 773.126 | 0.971 | 0.405 | | (8, 16, 512, 128) | causal_mask | torch.bfloat16 | 72.015 | 75.755 | 313.311 | 775.513 | 0.951 | 0.404 | | (8, 16, 512, 128) | relative_bias | torch.bfloat16 | 72.105 | 74.189 | 313.806 | 769.238 | 0.972 | 0.408 | | (8, 16, 512, 128) | head_bias | torch.bfloat16 | 72.005 | 74.364 | 313.509 | 775.237 | 0.968 | 0.404 | | (8, 16, 512, 256) | noop | torch.bfloat16 | 138.656 | 165.453 | 663.707 | 2672.067 | 0.838 | 0.248 | | (8, 16, 512, 256) | causal_mask | torch.bfloat16 | 139.096 | 172.613 | 663.593 | 2926.538 | 0.806 | 0.227 | | (8, 16, 512, 256) | relative_bias | torch.bfloat16 | 139.500 | 168.417 | 663.938 | 2658.629 | 0.828 | 0.250 | | (8, 16, 512, 256) | head_bias | torch.bfloat16 | 139.776 | 173.549 | 662.920 | 2667.266 | 0.805 | 0.249 | | (8, 16, 1024, 64) | noop | torch.bfloat16 | 134.883 | 125.004 | 484.706 | 1195.254 | 1.079 | 0.406 | | (8, 16, 1024, 64) | causal_mask | torch.bfloat16 | 134.297 | 132.875 | 485.420 | 1234.953 | 1.011 | 0.393 | | (8, 16, 1024, 64) | relative_bias | torch.bfloat16 | 134.839 | 139.231 | 485.470 | 1198.556 | 0.968 | 0.405 | | (8, 16, 1024, 64) | head_bias | torch.bfloat16 | 133.822 | 136.449 | 485.608 | 1189.198 | 0.981 | 0.408 | | (8, 16, 1024, 128) | noop | torch.bfloat16 | 235.470 | 234.765 | 886.094 | 2662.944 | 1.003 | 0.333 | | (8, 16, 1024, 128) | causal_mask | torch.bfloat16 | 236.305 | 241.382 | 886.293 | 2646.984 | 0.979 | 0.335 | | (8, 16, 1024, 128) | relative_bias | torch.bfloat16 | 236.414 | 233.980 | 885.250 | 2642.178 | 1.010 | 0.335 | | (8, 16, 1024, 128) | head_bias | torch.bfloat16 | 237.176 | 239.040 | 885.754 | 2665.242 | 0.992 | 0.332 | | (8, 16, 1024, 256) | noop | torch.bfloat16 | 504.445 | 517.855 | 1978.956 | 9592.906 | 0.974 | 0.206 | | (8, 16, 1024, 256) | causal_mask | torch.bfloat16 | 502.428 | 536.002 | 1978.611 | 10607.342 | 0.937 | 0.187 | | (8, 16, 1024, 256) | relative_bias | torch.bfloat16 | 503.396 | 523.960 | 1977.993 | 9539.284 | 0.961 | 0.207 | | (8, 16, 1024, 256) | head_bias | torch.bfloat16 | 503.818 | 536.014 | 1980.131 | 9576.262 | 0.940 | 0.207 | | (8, 16, 4096, 64) | noop | torch.bfloat16 | 1970.139 | 1674.930 | 5750.940 | 16724.134 | 1.176 | 0.344 | | (8, 16, 4096, 64) | causal_mask | torch.bfloat16 | 1959.036 | 1775.056 | 5780.512 | 17390.350 | 1.104 | 0.332 | | (8, 16, 4096, 64) | relative_bias | torch.bfloat16 | 1947.198 | 1773.869 | 5780.643 | 16779.699 | 1.098 | 0.345 | | (8, 16, 4096, 64) | head_bias | torch.bfloat16 | 1963.935 | 1829.502 | 5780.018 | 16703.259 | 1.073 | 0.346 | | (8, 16, 4096, 128) | noop | torch.bfloat16 | 3582.711 | 3362.623 | 10436.069 | 36415.565 | 1.065 | 0.287 | | (8, 16, 4096, 128) | causal_mask | torch.bfloat16 | 3581.504 | 3499.472 | 10346.869 | 36164.959 | 1.023 | 0.286 | | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 | 3589.779 | 3337.849 | 10529.621 | 36261.696 | 1.075 | 0.290 | | (8, 16, 4096, 128) | head_bias | torch.bfloat16 | 3602.265 | 3436.444 | 10458.660 | 36507.790 | 1.048 | 0.286 | | (8, 16, 4096, 256) | noop | torch.bfloat16 | 7695.923 | 7126.275 | 24643.009 | 140949.081 | 1.080 | 0.175 | | (8, 16, 4096, 256) | causal_mask | torch.bfloat16 | 7679.939 | 7186.252 | 24538.105 | 157156.067 | 1.069 | 0.156 | | (8, 16, 4096, 256) | relative_bias | torch.bfloat16 | 7681.374 | 6994.832 | 24549.713 | 140077.179 | 1.098 | 0.175 | | (8, 16, 4096, 256) | head_bias | torch.bfloat16 | 7679.822 | 7212.278 | 24627.823 | 140675.003 | 1.065 | 0.175 | | (16, 16, 512, 64) | noop | torch.bfloat16 | 80.126 | 78.291 | 333.719 | 541.165 | 1.023 | 0.617 | | (16, 16, 512, 64) | causal_mask | torch.bfloat16 | 80.065 | 81.696 | 333.779 | 551.113 | 0.980 | 0.606 | | (16, 16, 512, 64) | relative_bias | torch.bfloat16 | 80.138 | 86.715 | 333.364 | 542.118 | 0.924 | 0.615 | | (16, 16, 512, 64) | head_bias | torch.bfloat16 | 80.415 | 85.204 | 333.294 | 536.840 | 0.944 | 0.621 | | (16, 16, 512, 128) | noop | torch.bfloat16 | 134.964 | 138.025 | 607.093 | 1333.102 | 0.978 | 0.455 | | (16, 16, 512, 128) | causal_mask | torch.bfloat16 | 134.192 | 141.523 | 606.269 | 1424.318 | 0.948 | 0.426 | | (16, 16, 512, 128) | relative_bias | torch.bfloat16 | 135.711 | 138.639 | 606.283 | 1327.974 | 0.979 | 0.457 | | (16, 16, 512, 128) | head_bias | torch.bfloat16 | 135.552 | 140.555 | 607.107 | 1347.370 | 0.964 | 0.451 | | (16, 16, 512, 256) | noop | torch.bfloat16 | 275.113 | 315.144 | 1301.583 | 5268.153 | 0.873 | 0.247 | | (16, 16, 512, 256) | causal_mask | torch.bfloat16 | 274.867 | 328.106 | 1302.513 | 5770.594 | 0.838 | 0.226 | | (16, 16, 512, 256) | relative_bias | torch.bfloat16 | 276.052 | 321.770 | 1302.904 | 5241.920 | 0.858 | 0.249 | | (16, 16, 512, 256) | head_bias | torch.bfloat16 | 271.409 | 328.839 | 1302.142 | 5266.037 | 0.825 | 0.247 | | (16, 16, 1024, 64) | noop | torch.bfloat16 | 260.489 | 237.463 | 955.884 | 1817.558 | 1.097 | 0.526 | | (16, 16, 1024, 64) | causal_mask | torch.bfloat16 | 262.378 | 254.350 | 955.280 | 1843.807 | 1.032 | 0.518 | | (16, 16, 1024, 64) | relative_bias | torch.bfloat16 | 261.338 | 268.253 | 956.038 | 1820.036 | 0.974 | 0.525 | | (16, 16, 1024, 64) | head_bias | torch.bfloat16 | 262.153 | 264.156 | 956.023 | 1810.076 | 0.992 | 0.528 | | (16, 16, 1024, 128) | noop | torch.bfloat16 | 476.475 | 461.413 | 1760.578 | 4306.521 | 1.033 | 0.409 | | (16, 16, 1024, 128) | causal_mask | torch.bfloat16 | 473.794 | 479.178 | 1761.277 | 4619.439 | 0.989 | 0.381 | | (16, 16, 1024, 128) | relative_bias | torch.bfloat16 | 473.839 | 463.282 | 1758.692 | 4290.562 | 1.023 | 0.410 | | (16, 16, 1024, 128) | head_bias | torch.bfloat16 | 472.979 | 472.896 | 1763.086 | 4367.931 | 1.000 | 0.404 | | (16, 16, 1024, 256) | noop | torch.bfloat16 | 1014.184 | 1026.764 | 3922.997 | 19104.147 | 0.988 | 0.205 | | (16, 16, 1024, 256) | causal_mask | torch.bfloat16 | 1013.217 | 1039.046 | 3928.382 | 21086.281 | 0.975 | 0.186 | | (16, 16, 1024, 256) | relative_bias | torch.bfloat16 | 1008.519 | 1015.278 | 3922.133 | 18980.652 | 0.993 | 0.207 | | (16, 16, 1024, 256) | head_bias | torch.bfloat16 | 1011.360 | 1047.542 | 3931.245 | 19069.172 | 0.965 | 0.206 | | (16, 16, 4096, 64) | noop | torch.bfloat16 | 3929.850 | 3325.667 | 11411.704 | 23344.280 | 1.182 | 0.489 | | (16, 16, 4096, 64) | causal_mask | torch.bfloat16 | 3885.262 | 3581.544 | 11390.515 | 23725.639 | 1.085 | 0.480 | | (16, 16, 4096, 64) | relative_bias | torch.bfloat16 | 3865.737 | 3537.308 | 11489.901 | 23406.330 | 1.093 | 0.491 | | (16, 16, 4096, 64) | head_bias | torch.bfloat16 | 3880.530 | 3665.249 | 11484.411 | 23299.496 | 1.059 | 0.493 | | (16, 16, 4096, 128) | noop | torch.bfloat16 | 7030.306 | 6745.715 | 20621.264 | 57464.096 | 1.042 | 0.359 | | (16, 16, 4096, 128) | causal_mask | torch.bfloat16 | 7095.414 | 7034.385 | 20410.656 | 61660.511 | 1.009 | 0.331 | | (16, 16, 4096, 128) | relative_bias | torch.bfloat16 | 7084.779 | 6686.497 | 20315.161 | 57243.969 | 1.060 | 0.355 | | (16, 16, 4096, 128) | head_bias | torch.bfloat16 | 7075.367 | 6863.305 | 20494.385 | 58481.953 | 1.031 | 0.350 | | (16, 16, 4096, 256) | noop | torch.bfloat16 | 15612.741 | 14297.482 | 55306.847 | 281161.865 | 1.092 | 0.197 | | (16, 16, 4096, 256) | causal_mask | torch.bfloat16 | 15326.592 | 14263.878 | 55227.806 | 313063.232 | 1.075 | 0.176 | | (16, 16, 4096, 256) | relative_bias | torch.bfloat16 | 15297.963 | 14007.379 | 54558.029 | 279529.175 | 1.092 | 0.195 | | (16, 16, 4096, 256) | head_bias | torch.bfloat16 | 15216.160 | 14276.027 | 55081.581 | 280996.826 | 1.066 | 0.196 | </details> Pull Request resolved: pytorch#125515 Approved by: https://github.com/Chillee

Pull Request resolved: pytorch#126422 Approved by: https://github.com/angelayi

Deleting predispatch tests as we moved export to predispatch already Pull Request resolved: pytorch#126459 Approved by: https://github.com/tugsbayasgalan

**Overview** This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise). This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience. <details> <summary> Old Approach </summary> **Overview** - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.) - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general. - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh. </details> Pull Request resolved: pytorch#126258 Approved by: https://github.com/wanchaol

Currently it incorrectly has `Callable[[Tensor, str], Tensor]` as a possible type signature, this should be `Callable[[Storage, str], Storage]` <img width="716" alt="Screenshot 2024-05-03 at 12 09 54 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b8946f95-8297-445f-a9d9-570b8a3caab1"> Pull Request resolved: pytorch#125473 Approved by: https://github.com/albanD

# Motivation The doc string related `torch.Tensor.xpu` has been added [here](https://github.com/pytorch/pytorch/blob/d61a81a9e76688ac8f338a6cfba932bf7779e5ce/torch/_tensor_docs.py#L1434) but not expose it to public doc, like [torch.Tensor.cuda](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda). This PR intends to expose the document of `torch.Tensor.xpu` to public doc. Pull Request resolved: pytorch#126383 Approved by: https://github.com/albanD

This is typically the information you want when diagnosing why something overspecialized in dynamic shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126450 Approved by: https://github.com/albanD

This patch makes the inductor scheduler graph extension configurable. It enables ease of debugging by changing the graph format (dot, png, etc.). Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz Pull Request resolved: pytorch#125578 Approved by: https://github.com/Chillee

Fixes #ISSUE_NUMBER We need to compare ref_total_norm to total_norm.full_tensor(). Example: ``` iter_idx:0, rank:0,\ ref_total_norm=tensor(1052.5934, device='cuda:0'),\ total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\ total_norm.full_tensor()=tensor(1052.5934, device='cuda:0') ``` Pull Request resolved: pytorch#126457 Approved by: https://github.com/awgu

Otherwise you get an error in constant_pad_nd. Pull Request resolved: pytorch#126475 Approved by: https://github.com/huydhn ghstack dependencies: pytorch#125772, pytorch#125773, pytorch#125780

This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives. Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR. Test plan: ``` python test/distributed/test_collectives.py -v ``` This tests both functionality using multiple threads as well as timeout behavior. Pull Request resolved: pytorch#125978 Approved by: https://github.com/shuqiangzhang

distributed log category already includes pipelining since its under the torch.distributed umbrella. So both TORCH_LOGS=distributed and TORCH_LOGS=dist_pp will enable PP logs. Pull Request resolved: pytorch#126322 Approved by: https://github.com/kwen2501

As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`). Pull Request resolved: pytorch#126011 Approved by: https://github.com/wanchaol, https://github.com/XilunWu

When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models. This PR unifies the `empty_cache` call for both CUDA and XPU. Pull Request resolved: pytorch#126377 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire

Pull Request resolved: pytorch#126318 Approved by: https://github.com/anijain2305

…sor by default (pytorch#126423) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126423 Approved by: https://github.com/awgu

Pull Request resolved: pytorch#126461 Approved by: https://github.com/peterbell10

…r X86Inductor (pytorch#122593) **Description** Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add. **Supported linear-binary(-unary) patterns** ``` linear(X) extra input \ / Add | Optional(relu) | Y 1. int8-mixed-fp32 +---+---------------+-----------+------------------------------+---------+ | # | Add type | Quant out | Pattern | Post op | +---+---------------+-----------+------------------------------+---------+ | 1 | In-/out-place | Yes | linear + fp32 -> (relu) -> q | add | +---+---------------+-----------+------------------------------+---------+ | 2 | In-/out-place | No | linear + fp32 -> (relu) | sum | +---+---------------+-----------+------------------------------+---------+ 2. int8-mixed-bf16 +---+----------+---------------+-----------+--------------------------------------------------+---------+ | # | X2 dtype | Add type | Quant out | Pattern | Post op | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 1 | BF16 | In-/out-place | Yes | linear + bf16 -> (relu) -> to_fp32 -> q | add | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 2 | BF16 | In-/out-place | No | linear + bf16 -> (relu) | sum | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 3 | FP32 | Out-place | Yes | linear + fp32 -> (relu) -> q | add | | | | In-place right| | | | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 4 | FP32 | Out-place | No | linear + fp32 -> (relu) | sum | | | | In-place right| | | | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 5 | FP32 | In-place left | Yes | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 6 | FP32 | In-place left | No | linear + fp32 -> to_bf16 -> (relu) | add | +---+----------+---------------+-----------+--------------------------------------------------+---------+ ``` Note (1) The positions of linear and the extra input can be swapped. (2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the extra input, we don't match that pattern because we cannot match all these patterns in 3 passes. **Test plan** python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add Pull Request resolved: pytorch#122593 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison

add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6] Pull Request resolved: pytorch#126220 Approved by: https://github.com/eellison

Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel. Generated code: - Before: ``` for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } ``` - After:: ``` static WeightRecp<at::vec::Vectorized<float>> weight_recps(64); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } ``` Performance: - Single core: Op | shape | eager/ms | inductor/ms | optimized inductor/ms -- | -- | -- | -- | -- layernorm | (56, 384, 1024) | 16.825 | 22.338 | 15.208 var | (56, 384, 1024) | 21.752 | 13.258 | 13.102 - 4 cores: Op | shape | eager/ms | inductor/ms | optimized inductor/ms -- | -- | -- | -- | -- layernorm | (56, 384, 1024) | 4.249 | 5.899 | 4.223 var | (56, 384, 1024) | 5.3152 | 3.278 | 2.163 Pull Request resolved: pytorch#125148 Approved by: https://github.com/jgong5, https://github.com/peterbell10

After searching in the codebase, it seems that zstd is not in use now. Pull Request resolved: pytorch#126485 Approved by: https://github.com/ezyang

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126480 Approved by: https://github.com/peterbell10 ghstack dependencies: pytorch#126292, pytorch#126299

This fixes pytorch#126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. Pull Request resolved: pytorch#126497 Approved by: https://github.com/weifengpy, https://github.com/wz337

…elism (pytorch#126333) Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks. This changes so we default it to use as many threads as needed unless the env variable is set. Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023) Pull Request resolved: pytorch#126333 Approved by: https://github.com/nmacchioni

Oops, in pytorch#125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. pytorch#126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended. Pull Request resolved: pytorch#126407 Approved by: https://github.com/eellison

This PR is based on pytorch#125440, additionally merging the latest main branch and fixing the lint failures from pytorch#126361. Pull Request resolved: pytorch#126375 Approved by: https://github.com/janeyx99

Fix issue pytorch#125551 Pull Request resolved: pytorch#125639 Approved by: https://github.com/ezyang

My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema. From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/ > This rule has been removed and its documentation is only available for historical reasons. > > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead. and we are currently explicitly ignoring B904. Pull Request resolved: pytorch#126256 Approved by: https://github.com/Skylion007

It used to be vectorized only for f16, but no reason not to do the same for bf16 or f32 Spiritual followup of pytorch#125290 Pull Request resolved: pytorch#126512 Approved by: https://github.com/Skylion007

…pytorch#125982) ``` $ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once for node1, node2 in self.get_possible_fusions(): File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions check_all_pairs(node_grouping) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs if self.can_fuse(node1, node2): File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse return self.get_backend(device).can_fuse_vertical(node1, node2) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical return self._triton_scheduling.can_fuse_vertical(node1, node2) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse if not all( File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr> TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges()) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible cls._split_iteration_ranges(groups, lengths) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1: File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint return int(out) File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__ raise TypeError("Cannot convert symbols to int") torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: Cannot convert symbols to int ``` Where the unbacked symints show up at. ``` > /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges() (Pdb) print(groups) (1, 512*u0) (Pdb) print(lengths) ([u0, 32, 16], []) ``` Pull Request resolved: pytorch#125982 Approved by: https://github.com/jansel

…torch#126080) This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info Add workflow that is callable via workflow dispatch for uploading additional test stats Adds script that only calculates the additional info Pull Request resolved: pytorch#126080 Approved by: https://github.com/ZainRizvi

…` that allows users to allowlist classes for `weights_only` load (pytorch#124331) #### Conditions for allowlisting tensor subclasses We allow tensor subclasses types that (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`) (2) Use the generic `tp_alloc` (3) Are in a module that *has been imported by the user* to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2` *Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution. The rationale for the 3 conditions above is as follows: The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`) https://github.com/pytorch/pytorch/blob/4e66aaa01092ddc8822bbca315b673329c76f4cd/torch/_tensor.py#L57-L71 `as_subclass` is implemented with a call to `THPVariable_NewWithVar` that will eventually call `tp_alloc` here https://github.com/pytorch/pytorch/blob/4e66aaa01092ddc8822bbca315b673329c76f4cd/torch/csrc/autograd/python_variable.cpp#L2053 The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc` **Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling** ### How do we check something is a tensor subclass/constraints around imports In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)` This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`) ### API for allow listing This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe). Next steps: - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.) Pull Request resolved: pytorch#124331 Approved by: https://github.com/albanD

…6205) Pull Request resolved: pytorch#126205 Approved by: https://github.com/eellison

) Summary: The PT2E quantization flow does not support unquantized outputs yet. To work around this, users may wish to remove the output observer from their graphs. However, this fails currently in some cases because the `PortNodeMetaForQDQ` pass is too restrictive, for example: ``` conv -> obs -------> output0 \\-> add -> output1 ``` Previously we expected conv to always have exactly 1 user, which is the observer. When the observer is removed, however, conv now has 2 users, and this fails the check. ``` conv -------> output0 \\-> add -> output1 ``` This commit relaxes the error into a warning to enable this workaround. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer Reviewers: jerryzh168 Subscribers: jerryzh168, supriyar Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601) Pull Request resolved: pytorch#126487 Approved by: https://github.com/tarun292

Add Execution Trace communication collective meta data. For specification see pytorch#124674 New fields look like ``` { "id": 80, "name": "record_param_comms", "ctrl_deps": 79, "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]}, "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]}, "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""}, {"name": "collective_name", "type": "string", "value": "allreduce"}, {"name": "dtype", "type": "string", "value": "Float"}, {"name": "in_msg_nelems", "type": "uint64", "value": 100}, {"name": "out_msg_nelems", "type": "uint64", "value": 100}, {"name": "in_split_size", "type": "string", "value": "[]"}, {"name": "out_split_size", "type": "string", "value": "[]"}, {"name": "global_rank_start", "type": "uint64", "value": 0}, {"name": "global_rank_stride", "type": "uint64", "value": 1}, {"name": "pg_name", "type": "string", "value": "0"}, {"name": "pg_desc", "type": "string", "value": "default_pg"}, {"name": "pg_size", "type": "uint64", "value": 2}] } ``` ## Unit Test Added a new unit test to check the execution trace collected has right attributes `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace` ``` STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing [rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer [rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up [rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection [rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Execution trace saved at /tmp/tmpy01ngc3w.et.json Execution trace saved at /tmp/tmptf8543k4.et.json ok ---------------------------------------------------------------------- ``` Also run profilerunit test `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler` ``` STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing [rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Trace saved to /tmp/tmpdrw_cmcu.json Trace saved to /tmp/tmpnio7ec9j.json ok ---------------------------------------------------------------------- Ran 1 test in 19.772s OK ``` Pull Request resolved: pytorch#126317 Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise

This reverts commit aab448e. Reverted pytorch#126249 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing sigmoid/frontend:serialization_test internally ([comment](pytorch#126249 (comment)))

This reverts commit 91bf952. Reverted pytorch#126290 on behalf of https://github.com/huydhn due to There seems to be a mis-match closing curly bracket here and it breaks some internal build in D57474505 ([comment](pytorch#126290 (comment)))

Summary: This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568 This algorithm is going to be used by multiple people, hence we need make it official implementation. Differential Revision: D57227565 Pull Request resolved: pytorch#126153 Approved by: https://github.com/jerryzh168, https://github.com/huydhn

Adds https://github.com/yhirose/cpp-httplib such that we are able to use https for host to host communication in distributed (specifically torchrun) Todo: We likely need to add cpp-httplib somewhere in the build (cmake/bazel) but first we should write the code for it. Pull Request resolved: pytorch#126470 Approved by: https://github.com/d4l3k, https://github.com/Skylion007

Slightly improve exception typing for tensorboard wrriter Pull Request resolved: pytorch#126534 Approved by: https://github.com/ezyang

This reverts commit 3f28906. Reverted pytorch#126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](pytorch#126497 (comment)))

Reopen due to rebase error. Fixes pytorch#117599 The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh ``` :4:command.cpp :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070 :4:rocvirtual.cpp :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00 :3:rocvirtual.hpp :66 : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns ``` Pull Request resolved: pytorch#125456 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99

…rt (pytorch#124449) Differential Revision: [D56440613](https://our.internmc.facebook.com/intern/diff/D56440613) We want to do this for following reasons: 1. There is current limitation in export tracing for torch.jit.trace d modules that cannot be easily upstreamed 2. We need to run internal CI regularly to understand feature gaps and continuously track them 3. Multiple people will be working on this prototype so it is better to have a checked in version so we don't always run into merge conflicts. Pull Request resolved: pytorch#124449 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri

Fails flakily ex https://github.com/pytorch/pytorch/actions/runs/9130802617/job/25109131748 https://github.com/pytorch/pytorch/actions/runs/9125548571/job/25092535707 First bad I can find is https://hud.pytorch.org/pytorch/pytorch/commit/538877d2046a492a1112101e2d5d88e5754d477b Pull Request resolved: pytorch#126571 Approved by: https://github.com/SS-JIA

Test Plan: AOTI compile stories15M for Android Differential Revision: D57392830 Pull Request resolved: pytorch#126306 Approved by: https://github.com/desertfire

Summary: Support kwarg bias for nn.linear quantization Differential Revision: D57403190 Pull Request resolved: pytorch#126331 Approved by: https://github.com/ZhengkaiZ, https://github.com/huydhn

The `compile` + `exec` workflow is susceptible to behavior drifting from a "normal" import use importlib instead to avoid this. In particular here annotations were being stored as strings due to `from __futures__ import annotations` in the scope calling `compile`. Triton cares about annotations on global variables and this makes it much easier to reliably code-gen them. Pull Request resolved: pytorch#126454 Approved by: https://github.com/peterbell10

…6468) Pull Request resolved: pytorch#126468 Approved by: https://github.com/Skylion007

Fixes pytorch#121188 Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d' Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format. Additionally, this commit includes tests to cover the three referenced cases. Pull Request resolved: pytorch#121906 Approved by: https://github.com/janeyx99

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere

e.g. dist_ddp -> ddp 'distributed' shortcut remains unchained Feedback has been that it is not appealing to have the dist_ prefix, and the main reason for it was to keep the distributed shortcuts grouped together in the help menu. It's nice to have shorter shortcuts. Pull Request resolved: pytorch#126499 Approved by: https://github.com/XilunWu, https://github.com/kwen2501 ghstack dependencies: pytorch#126322

Summary: Tool for scouting exportability issues in one shot. - Collect sample inputs for all submodules by running eager inference with forward_pre_hook. - Start from root module, recursively try exporting child modules, if current module export fails. Limitations: - only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule. TODO: support dynamic_dims Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing ``` exportability_report = { '': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>), 'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>), 'submod_2': None } ``` Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools Differential Revision: D57466486 Pull Request resolved: pytorch#126471 Approved by: https://github.com/zhxchen17

…26496) Summary: https://docs.python.org/3/library/os.html#os.makedirs > If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists. Test Plan: Existing tests Differential Revision: D57471577 Pull Request resolved: pytorch#126496 Approved by: https://github.com/d4l3k

…ytorch#126572) Summary: Added USE_LITE_AOTI cmake flag, which is turned OFF by default. When it is turned on, the AOTI sources (inductor_core_resources) are included when building lite interpreter Test Plan: ``` ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON ``` Differential Revision: D57394078 Pull Request resolved: pytorch#126572 Approved by: https://github.com/malfet

This schedule was running fine locally but failing (hanging) on CI. After analysis (https://fburl.com/gdoc/xt80h1gd), it seems like the schedule was not correct previously but may still work depending on the runtime. The fix bundles together fwd-recv(s->s+1) and bwd-send(s+1->s) into one coalesced group so they would not block each other. Design drawing <img width="803" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/906a9a66-39ae-4a6a-bc1a-18b77eaaa784"> Flight recorder traces show the same coalescing pattern as designed <img width="1013" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/ab10646e-eaef-4191-83dd-73f448876c27"> Pull Request resolved: pytorch#126419 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501

Pull Request resolved: pytorch#126538 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/c-p-i-o ghstack dependencies: pytorch#126419

Fixes issue introduced in pytorch#126470 (comment) Test plan: CI Pull Request resolved: pytorch#126580 Approved by: https://github.com/PaliC, https://github.com/jeffdaily

Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py. But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs. Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy. Pull Request resolved: pytorch#126217 Approved by: https://github.com/H-Huang

Because it was updated 4 years ago, and now all supported CUDA versions provide CUB. Pull Request resolved: pytorch#126540 Approved by: https://github.com/Skylion007

…orch#126415) This PR is primarily just moving stuff around. It creates a new common baseclass for TritonCodegen and the (upcoming) HalideCodegen. Pull Request resolved: pytorch#126415 Approved by: https://github.com/shunting314

Differential Revision: [D57369266](https://our.internmc.facebook.com/intern/diff/D57369266/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D57369266/)! Pull Request resolved: pytorch#126297 Approved by: https://github.com/malfet

…hrough-torch.compile (pytorch#124070) Add scalar information to the kernel configuration. #### Additional Context Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading. However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs. Pull Request resolved: pytorch#124070 Approved by: https://github.com/jansel, https://github.com/jgong5

# Summary Different take on this one: pytorch#126338 We should probably not allow this mapping for 'compute' ops e.g. reductions ### Corresponding fp8 PR pytorch-labs/float8_experimental#263 Pull Request resolved: pytorch#126556 Approved by: https://github.com/wanchaol

…backward hook (pytorch#126350) As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~ As discussed below, we will change eager path to use per-tensor backward hook as well. Pull Request resolved: pytorch#126350 Approved by: https://github.com/awgu

Fixes pytorch#115711 Pull Request resolved: pytorch#126466 Approved by: https://github.com/jansel

…orch#126458) Improve variable and function naming for better clarity: `non strict` --> `aten`. Pull Request resolved: pytorch#126458 Approved by: https://github.com/angelayi

…h#125538) Fixes pytorch#123451 (only addresses test_torch.py cases) This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure. I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Pull Request resolved: pytorch#125538 Approved by: https://github.com/janeyx99

Summary: TSIA. The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used: ``` _________________________ ReproTests.test_issue126128 __________________________ self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128> def test_issue126128(self): def fn(): x = torch.randn(1, 10) y = torch.randn(10, 1) return torch.mm(x, y).sum() def fn2(): x = torch.randn(10, 100) y = torch.randn(100, 10) return torch.mm(x, y).sum() > with torch._inductor.utils.fresh_inductor_cache(): E AttributeError: module 'torch._inductor' has no attribute 'utils' ``` Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'` Differential Revision: D57516676 Pull Request resolved: pytorch#126596 Approved by: https://github.com/xmfan

Pull Request resolved: pytorch#126613 Approved by: https://github.com/anijain2305

…rch#126466)" This reverts commit 6bb9d60. Reverted pytorch#126466 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the ONNX test failure looks legit, not flaky, as it starts failing in trunk https://hud.pytorch.org/pytorch/pytorch/commit/6bb9d6080d33c817fcbf9e5ae8a59b76812a53d2 ([comment](pytorch#126466 (comment)))

Dead implementations are confusing and can cause bugs when people accidentally hit them. Better for it to be missing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126511 Approved by: https://github.com/peterbell10, https://github.com/lezcano

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126553 Approved by: https://github.com/lezcano, https://github.com/Skylion007 ghstack dependencies: pytorch#126511

can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing Pull Request resolved: pytorch#126414 Approved by: https://github.com/ezyang

…ytorch#126606) To remove the disrupting warning ``` warnings.warn("torch.library.impl_abstract was renamed to " "torch.library.register_fake. Please use that instead; " "we will remove torch.library.impl_abstract in a future " "version of PyTorch.", DeprecationWarning, stacklevel=2) ``` Pull Request resolved: pytorch#126606 Approved by: https://github.com/ezyang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow linalg.lstsq to use svd to compute the result for rank deficient matrices. #125110

Allow linalg.lstsq to use svd to compute the result for rank deficient matrices. #125110

Commits on Apr 28, 2024

Commits on Apr 29, 2024

Commits on Apr 30, 2024

Commits on May 1, 2024

Commits on May 13, 2024

Commits on May 14, 2024

Commits on May 19, 2024