New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow linalg.lstsq to use svd to compute the result for rank deficient matrices. #125110
Commits on Apr 28, 2024
-
Add logic for lstsq to be able to use the SVD driver as a backend for…
… when matrices are rank deficient.
Configuration menu - View commit details
-
Copy full SHA for 7372645 - Browse repository at this point
Copy the full SHA 7372645View commit details -
Configuration menu - View commit details
-
Copy full SHA for 99e7cfb - Browse repository at this point
Copy the full SHA 99e7cfbView commit details -
Configuration menu - View commit details
-
Copy full SHA for e0fec86 - Browse repository at this point
Copy the full SHA e0fec86View commit details -
Update aten/src/ATen/native/BatchLinearAlgebra.cpp
Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for bb20952 - Browse repository at this point
Copy the full SHA bb20952View commit details -
Address comments. Clean up use of zeros and utilize higher level func…
…tion linalg_svd for computation
Configuration menu - View commit details
-
Copy full SHA for b6d6086 - Browse repository at this point
Copy the full SHA b6d6086View commit details -
Configuration menu - View commit details
-
Copy full SHA for 755e7d9 - Browse repository at this point
Copy the full SHA 755e7d9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6e8b3fd - Browse repository at this point
Copy the full SHA 6e8b3fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for c71e504 - Browse repository at this point
Copy the full SHA c71e504View commit details
Commits on Apr 29, 2024
-
Configuration menu - View commit details
-
Copy full SHA for d5b0174 - Browse repository at this point
Copy the full SHA d5b0174View commit details -
Configuration menu - View commit details
-
Copy full SHA for da81459 - Browse repository at this point
Copy the full SHA da81459View commit details -
Configuration menu - View commit details
-
Copy full SHA for c856b9e - Browse repository at this point
Copy the full SHA c856b9eView commit details
Commits on Apr 30, 2024
-
Update aten/src/ATen/native/BatchLinearAlgebra.cpp
Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for de502bc - Browse repository at this point
Copy the full SHA de502bcView commit details
Commits on May 1, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 3006f30 - Browse repository at this point
Copy the full SHA 3006f30View commit details
Commits on May 13, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 428f02a - Browse repository at this point
Copy the full SHA 428f02aView commit details
Commits on May 14, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 4eab1c3 - Browse repository at this point
Copy the full SHA 4eab1c3View commit details -
Configuration menu - View commit details
-
Copy full SHA for fec9793 - Browse repository at this point
Copy the full SHA fec9793View commit details -
Configuration menu - View commit details
-
Copy full SHA for 489afbe - Browse repository at this point
Copy the full SHA 489afbeView commit details
Commits on May 19, 2024
-
[export] handle aliased/unused params for unflattening (pytorch#125758)
Pull Request resolved: pytorch#125758 Aliased and unused params are currently an issue for strict-mode export. For a model like this: ``` def __init__(self): # ... self.alpha = nn.Parameter(torch.randn(4)) self.beta = self.alpha self.gamma = self.alpha def forward(self, x): return x + self.beta ``` Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module). This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass. Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too. Pull Request resolved: pytorch#125758 Approved by: https://github.com/zhxchen17
Configuration menu - View commit details
-
Copy full SHA for 93d2573 - Browse repository at this point
Copy the full SHA 93d2573View commit details -
Enable epilogue fusion benchmarking internally (pytorch#125455)
Differential Revision: [D56920738](https://our.internmc.facebook.com/intern/diff/D56920738) Pull Request resolved: pytorch#125455 Approved by: https://github.com/Chillee
Configuration menu - View commit details
-
Copy full SHA for 4f024c8 - Browse repository at this point
Copy the full SHA 4f024c8View commit details -
Fanatically correct real tensor cloning for propagate_real_tensors (p…
…ytorch#126175) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/ Previously I did it in a crappy way using clone_input in the callback, but this results in tensors that don't have quite the same size/stride/storage offset and there was an internal test case where not having completely accurate information was causing a downstream problem in propagation. So now I make real tensors as similar to their fake equivalents as much as possible. Though... I don't bother with autograd lol. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126175 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for a2e8b90 - Browse repository at this point
Copy the full SHA a2e8b90View commit details -
[reland][dynamo][disable] Move disable impl to its own __call__ method (
pytorch#126191) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#126191 Approved by: https://github.com/yoyoyocmu, https://github.com/yanboliang, https://github.com/fegin
Configuration menu - View commit details
-
Copy full SHA for 10b10f2 - Browse repository at this point
Copy the full SHA 10b10f2View commit details -
[easy][dynamo] Use disable_dynamo for torch.manual_seed (pytorch#126192)
Pull Request resolved: pytorch#126192 Approved by: https://github.com/yanboliang ghstack dependencies: pytorch#126191
Configuration menu - View commit details
-
Copy full SHA for f209865 - Browse repository at this point
Copy the full SHA f209865View commit details -
Revert "[inductor][cpp] GEMM template (infra and fp32) (pytorch#124021)"
This reverts commit 037615b. Reverted pytorch#124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](pytorch#124021 (comment)))
Configuration menu - View commit details
-
Copy full SHA for a95b7e9 - Browse repository at this point
Copy the full SHA a95b7e9View commit details -
Revert "[CUDA] [CI] Add cu124 docker images (pytorch#125944)"
This reverts commit 5fb4a76. Reverted pytorch#125944 on behalf of https://github.com/nWEIdia due to test failure seems related https://hud.pytorch.org/pytorch/pytorch/commit/5fb4a766b88bcf633a23610bd66de0f3020f7c66 https://github.com/pytorch/pytorch/actions/runs/9085206167/job/24972040039 ([comment](pytorch#125944 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 50b88b0 - Browse repository at this point
Copy the full SHA 50b88b0View commit details -
Remove use of USE_C10D (pytorch#126120)
As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here. Pull Request resolved: pytorch#126120 Approved by: https://github.com/aaronenyeshi
Configuration menu - View commit details
-
Copy full SHA for 37f84cb - Browse repository at this point
Copy the full SHA 37f84cbView commit details -
[torch/distributed] Bugfix: wait for all child procs to exit before c… (
pytorch#125969) Observed Problem --------------------- When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully. This results in misleading warning log messages towards the end of the job like the one below: ``` W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM # <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ ---> I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish. I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'. I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main` ``` Root Cause ------------------ I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`. `torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`. `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited. Fix --------- The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True` > **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function. > **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed. Pull Request resolved: pytorch#125969 Approved by: https://github.com/d4l3k
Configuration menu - View commit details
-
Copy full SHA for 00b9974 - Browse repository at this point
Copy the full SHA 00b9974View commit details -
Allow for trailing 'a' in sm_arch (pytorch#126185)
# Summary I was getting ``` Shell File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a' ``` Pull Request resolved: pytorch#126185 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 1dfe2d1 - Browse repository at this point
Copy the full SHA 1dfe2d1View commit details -
[pipelining] Add manual pipeline stage (pytorch#126123)
Add `ManualPipelineStage` under `_PipelineStage.py` Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible. Pull Request resolved: pytorch#126123 Approved by: https://github.com/kwen2501
Configuration menu - View commit details
-
Copy full SHA for ed27236 - Browse repository at this point
Copy the full SHA ed27236View commit details -
Refactor make_fx to better support hop subgraph tracing (pytorch#125267)
Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff. Test Plan: Existing tests. Pull Request resolved: pytorch#125267 Approved by: https://github.com/Chillee
Configuration menu - View commit details
-
Copy full SHA for 636ea1c - Browse repository at this point
Copy the full SHA 636ea1cView commit details -
Support trace_subgraph in _MakefxTracer (pytorch#125363)
Adds trace_subgraph to _MakefxTracer, the motivation is in pytorch#122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph. **Test Plan:** Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes. Also fixes pytorch#124643 Pull Request resolved: pytorch#125363 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#125267
Configuration menu - View commit details
-
Copy full SHA for a745003 - Browse repository at this point
Copy the full SHA a745003View commit details -
[Dynamo] Supports torch._C._is_any_autocast_enabled (pytorch#126196)
Fixes pytorch#126026 Pull Request resolved: pytorch#126196 Approved by: https://github.com/anijain2305
Configuration menu - View commit details
-
Copy full SHA for 976f0f2 - Browse repository at this point
Copy the full SHA 976f0f2View commit details -
Set dtype when copying empty tensor (pytorch#126124)
Summary: Forward fix D57251348 Test Plan: `buck2 test 'fbcode//mode/dev' fbcode//executorch/kernels/test:aten_op_copy_test` Differential Revision: D57304360 Pull Request resolved: pytorch#126124 Approved by: https://github.com/bdhirsh
Configuration menu - View commit details
-
Copy full SHA for b3f0fce - Browse repository at this point
Copy the full SHA b3f0fceView commit details -
[BE] Abstract out strings to top of file (pytorch#125640)
Summary: Move const strings to top of file. This is in preparation of tooling to make use of shared constants (e.g. version string). A non-functional change. Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change. Test Plan: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: pytorch#125640 Approved by: https://github.com/wconstab
Configuration menu - View commit details
-
Copy full SHA for aa17484 - Browse repository at this point
Copy the full SHA aa17484View commit details -
[Inductor] Flex attention supports dynamic shape (pytorch#125994)
## static shapes perf ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------| | Average | 0.692 | | | | | | | | | Max | 0.855 | 16 | 16 | 4096 | 4096 | 64 | head_bias | torch.bfloat16 | | Min | 0.419 | 8 | 16 | 512 | 512 | 256 | noop | torch.bfloat16 | ``` ## dynamic shapes perf ``` | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | head_dim | score_mod | dtype | |---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------| | Average | 0.670 | | | | | | | | | Max | 0.864 | 16 | 16 | 4096 | 4096 | 64 | relative_bias | torch.bfloat16 | | Min | 0.376 | 8 | 16 | 512 | 512 | 256 | relative_bias | torch.bfloat16 | ``` Pull Request resolved: pytorch#125994 Approved by: https://github.com/Chillee
Configuration menu - View commit details
-
Copy full SHA for 685b207 - Browse repository at this point
Copy the full SHA 685b207View commit details -
Add missing type uint16, uint32, and uint64 to TensorHash in LTC. (py…
…torch#125972) If I do: ``` xla_device = xm.xla_device() xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device) ``` I got the error: ``` RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16 ``` This PR intends to fix this issue. The data type can be found in pytorch/c10/core/ScalarType.h. Pull Request resolved: pytorch#125972 Approved by: https://github.com/JackCaoG
Configuration menu - View commit details
-
Copy full SHA for b959b4f - Browse repository at this point
Copy the full SHA b959b4fView commit details -
Add some type annotations to python stream and event classes (pytorch…
…#126171) For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support. Pull Request resolved: pytorch#126171 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 86d560a - Browse repository at this point
Copy the full SHA 86d560aView commit details -
Support third-party devices emit a range for each autograd operator (p…
…ytorch#125822) Fixes pytorch#125752 Pull Request resolved: pytorch#125822 Approved by: https://github.com/aaronenyeshi
Configuration menu - View commit details
-
Copy full SHA for 074173b - Browse repository at this point
Copy the full SHA 074173bView commit details -
[Inductor] Skip test_nll_loss_backward for intel GPU. (pytorch#126157)
Skip this test case due to unaligned behavior to CUDA for Triton `mask_load`. We submitted issue pytorch#126173 to elaborate on the root cause. We intend to skip this case for XPU first as we need to take some time to fix the issue and have full validation to update the Triton commit pin for Intel GPU. Pull Request resolved: pytorch#126157 Approved by: https://github.com/EikanWang, https://github.com/peterbell10, https://github.com/desertfire
Configuration menu - View commit details
-
Copy full SHA for d0688dd - Browse repository at this point
Copy the full SHA d0688ddView commit details -
use statically known instead of suppress guard for ddp stride propaga…
…tion (pytorch#126234) Pull Request resolved: pytorch#126234 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for a749763 - Browse repository at this point
Copy the full SHA a749763View commit details -
Update CUDA out of memory mesage with private pool info (pytorch#124673)
Fixes pytorch#121932 Pull Request resolved: pytorch#124673 Approved by: https://github.com/eellison, https://github.com/eqy
Configuration menu - View commit details
-
Copy full SHA for 11aea9e - Browse repository at this point
Copy the full SHA 11aea9eView commit details -
Adjust number of repeats when using --warm-start-latency benchmark fl…
…ag (pytorch#125917) Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat Pull Request resolved: pytorch#125917 Approved by: https://github.com/desertfire
Configuration menu - View commit details
-
Copy full SHA for 32fdb75 - Browse repository at this point
Copy the full SHA 32fdb75View commit details -
[benchmarking] Suppress csv creation on cold-start phase of --warm-st…
…art-latency (pytorch#125953) Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes. Pull Request resolved: pytorch#125953 Approved by: https://github.com/desertfire ghstack dependencies: pytorch#125917
Configuration menu - View commit details
-
Copy full SHA for 38e2661 - Browse repository at this point
Copy the full SHA 38e2661View commit details -
Add a few "warm start" smoketest runs to CI (pytorch#125955)
Summary: Not sure which to choose, so my criteria was: 1) We care about huggingface as part of internal milestones 2) This handful of models seems to particularly benefite from caching Pull Request resolved: pytorch#125955 Approved by: https://github.com/desertfire ghstack dependencies: pytorch#125917, pytorch#125953
Configuration menu - View commit details
-
Copy full SHA for bc9f57b - Browse repository at this point
Copy the full SHA bc9f57bView commit details -
[audio hash update] update the pinned audio hash (pytorch#126248)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#126248 Approved by: https://github.com/pytorchbot
Configuration menu - View commit details
-
Copy full SHA for ce7a832 - Browse repository at this point
Copy the full SHA ce7a832View commit details -
Add force_disable_caches to the docs (pytorch#126184)
Pull Request resolved: pytorch#126184 Approved by: https://github.com/msaroufim
Configuration menu - View commit details
-
Copy full SHA for 3512895 - Browse repository at this point
Copy the full SHA 3512895View commit details -
[inductor][cpp] GEMM template (infra and fp32) (pytorch#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC pytorch#125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes | Benchmark | torchbench | huggingface | timm_models | |------------|-------------|--------------|--------------| | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x | | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x | | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x | Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes | Benchmark | torchbench | huggingface | timm_models | | --- | --- | --- | --- | | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x | | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x | | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x | Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Pull Request resolved: pytorch#124021 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for b743f89 - Browse repository at this point
Copy the full SHA b743f89View commit details -
[CUDA] [CI] Add cu124 docker images (pytorch#125944)
Fixes issues encountered in pytorch#121956 Pull Request resolved: pytorch#125944 Approved by: https://github.com/atalman
Configuration menu - View commit details
-
Copy full SHA for 170380e - Browse repository at this point
Copy the full SHA 170380eView commit details -
Don't assert about pending when we are peeking (pytorch#126239)
Internal xref https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/ In particular, when we're collecting forward metadata, we aren't going to discharge any of the pending, so we'll be continuously collecting more and more pending symbols that we may not be able to resolve. This is fine. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126239 Approved by: https://github.com/lezcano
Configuration menu - View commit details
-
Copy full SHA for f33cc7a - Browse repository at this point
Copy the full SHA f33cc7aView commit details -
[AOTI][torchgen] Update NativeFunctionsGroup mapping (pytorch#125962)
Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping. Pull Request resolved: pytorch#125962 Approved by: https://github.com/chenyang78
Configuration menu - View commit details
-
Copy full SHA for 8dc8ae9 - Browse repository at this point
Copy the full SHA 8dc8ae9View commit details -
[AOTI][torchgen] Add a few more fallback ops (pytorch#126013)
Summary: They appear in some unit tests. Pull Request resolved: pytorch#126013 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#125962
Configuration menu - View commit details
-
Copy full SHA for 5bc525c - Browse repository at this point
Copy the full SHA 5bc525cView commit details -
[Memory Snapshot] Add recordAnnotations to capture record_function an…
…notations (pytorch#124179) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI New Snapshot Generated: devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations: ``` [[{'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168556, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168738, 'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168865, 'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]}, {'action': 'user_defined', 'addr': 0, 'size': 0, 'stream': 0, 'time_us': 1713558427168920, 'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]}, {'action': 'alloc', 'addr': 140166073581568, 'size': 3211264, 'stream': 0, 'time_us': 1713558427172978, 'frames': [{'name': '_conv_forward', 'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv ``` Differential Revision: D55941362 Pulled By: aaronenyeshi Pull Request resolved: pytorch#124179 Approved by: https://github.com/zdevito
Configuration menu - View commit details
-
Copy full SHA for 4e3dfb0 - Browse repository at this point
Copy the full SHA 4e3dfb0View commit details -
Enable UFMT on
test/test_fake_tensor.py
, `test/test_flop_counter.py……` and some files (pytorch#125747) Part of: pytorch#123062 Ran lintrunner on: - test/test_fake_tensor.py - test/test_flop_counter.py - test/test_function_schema.py - test/test_functional_autograd_benchmark.py - test/test_functional_optim.py - test/test_functionalization_of_rng_ops.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: pytorch#125747 Approved by: https://github.com/malfet
Configuration menu - View commit details
-
Copy full SHA for 68c29aa - Browse repository at this point
Copy the full SHA 68c29aaView commit details -
[Inductor] Generalize new introduced device-bias code. (pytorch#126261)
We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors. Pull Request resolved: pytorch#126261 Approved by: https://github.com/EikanWang, https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for cd60801 - Browse repository at this point
Copy the full SHA cd60801View commit details -
[export] Cover more cases to copy tensor conversions. (pytorch#125628)
Summary: Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: pytorch/PiPPy#1104 (comment) I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here. Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion Differential Revision: D56951634 Pull Request resolved: pytorch#125628 Approved by: https://github.com/tugsbayasgalan
Configuration menu - View commit details
-
Copy full SHA for a48463e - Browse repository at this point
Copy the full SHA a48463eView commit details -
Revert "[Memory Snapshot] Add recordAnnotations to capture record_fun…
…ction annotations (pytorch#124179)" This reverts commit 187aeae. Reverted pytorch#124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 https://hud.pytorch.org/pytorch/pytorch/commit/187aeaeabf612824c2d0e9be72f80ce6612760d4, test was skipped due to bad TD ([comment](pytorch#124179 (comment)))
Configuration menu - View commit details
-
Copy full SHA for ff266cd - Browse repository at this point
Copy the full SHA ff266cdView commit details -
[CI] 3 procs non cuda (pytorch#125932)
Too lazy to figure out actual time reduction here, I'll figure it out later. Also I'd rather get an average of a couple of runs on trunk rather than just this one PR Things got faster. Source? Trust me bro * rel to pytorch#125598 Pull Request resolved: pytorch#125932 Approved by: https://github.com/ZainRizvi
Configuration menu - View commit details
-
Copy full SHA for e49ccce - Browse repository at this point
Copy the full SHA e49ccceView commit details -
Foward fix lint after pytorch#125747 (pytorch#126295)
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126295 Approved by: https://github.com/atalman
Configuration menu - View commit details
-
Copy full SHA for 0df5ed0 - Browse repository at this point
Copy the full SHA 0df5ed0View commit details -
Faster int8 quantized (pytorch#125704)
Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) ) Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`) Before the change, on M2 Pro I get 50 tokens per sec After adding a very naive ```metal template<typename T> kernel void int8pack_mm( constant T * A [[buffer(0)]], constant char * B [[buffer(1)]], constant T * scales [[buffer(2)]], device T * outputData [[buffer(3)]], constant uint3 & sizes [[buffer(4)]], uint thread_index [[thread_position_in_grid]]) { const uint lda = sizes.y; const uint ldc = sizes.z; const uint m = thread_index / sizes.z; // 0..sizes.x-1 const uint n = thread_index % sizes.z; // 0..sizes.z-1 constant T *A_ptr = A + m * lda; constant char *B_ptr = B + n * lda; float rc = 0.0; for(uint k = 0; k < sizes.y; k++) { const auto a_val = float(A_ptr[k]); const auto b_val = float(B_ptr[k]); rc += a_val * b_val; } outputData[thread_index] = T(rc * float(scales[n])); } ``` Perf dropped down to sad 15 tokens per seconds. Replacing inner loop with vectorized operations ```metal float rc = 0.0; for(uint k = 0; k < sizes.y/4; k++) { const auto a_val = float4(A_ptr[k]); const auto b_val = float4(B_ptr[k]); rc += dot(a_val, b_val); } ``` Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf. Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain) There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in https://github.com/ml-explore/mlx/blob/631dfbe67309fb630795cd612739cbe54c75e222/mlx/backend/metal/kernels/gemv.metal#L184 which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test Pull Request resolved: pytorch#125704 Approved by: https://github.com/mikekgfb
Configuration menu - View commit details
-
Copy full SHA for 12f2960 - Browse repository at this point
Copy the full SHA 12f2960View commit details -
[DTensor] Turn on foreach implementation of optimizer for DTensor by …
…default (pytorch#123394) Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users. Pull Request resolved: pytorch#123394 Approved by: https://github.com/wanchaol
Configuration menu - View commit details
-
Copy full SHA for 9b24e7f - Browse repository at this point
Copy the full SHA 9b24e7fView commit details -
[Dynamo] SizeVariable supports hasattr (pytorch#126222)
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126222 Approved by: https://github.com/williamwen42, https://github.com/anijain2305
Configuration menu - View commit details
-
Copy full SHA for 22c50a3 - Browse repository at this point
Copy the full SHA 22c50a3View commit details -
CMake: Improve check and report of Magma (pytorch#117858)
- Only search for magma if it is used (GPU builds) - Don't report it was not found when it isn't searched for - Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported) Pull Request resolved: pytorch#117858 Approved by: https://github.com/malfet
Configuration menu - View commit details
-
Copy full SHA for 35117bf - Browse repository at this point
Copy the full SHA 35117bfView commit details -
[onnx.export] Avoid linear loop over symbol_dim_map (pytorch#123029)
This PR is part of an effort to speed up torch.onnx.export (pytorch#121422). - Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export. - Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity. - This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it. - Resolves (9) in pytorch#121422. (partial fix of pytorch#121422) Pull Request resolved: pytorch#123029 Approved by: https://github.com/justinchuby
Configuration menu - View commit details
-
Copy full SHA for 3ca1ae4 - Browse repository at this point
Copy the full SHA 3ca1ae4View commit details -
[easy] Remove aot_config from pre_compile returns, rename fw_metadata…
… in post_compile (pytorch#125854) This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor. As @aorenste points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile. Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves. Currently, wrappers come in two categories: 1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile 2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata. So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR. Pull Request resolved: pytorch#125854 Approved by: https://github.com/aorenste, https://github.com/bdhirsh
Configuration menu - View commit details
-
Copy full SHA for 39b2795 - Browse repository at this point
Copy the full SHA 39b2795View commit details -
Reland '[Inductor] GEMM shape padding improvements (pytorch#118522)' (p…
…ytorch#125773) Relanding just the pad in a single pass portion of [the pr](pytorch#118522). Not including the transpose logic: This was previously accepted and reviewed. Pull Request resolved: pytorch#125773 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#125772
Configuration menu - View commit details
-
Copy full SHA for b3b9f72 - Browse repository at this point
Copy the full SHA b3b9f72View commit details -
Skip padding cost of fusible/planable inputs (pytorch#125780)
For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1` Pull Request resolved: pytorch#125780 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#125772, pytorch#125773
Configuration menu - View commit details
-
Copy full SHA for 0ce75f9 - Browse repository at this point
Copy the full SHA 0ce75f9View commit details -
Forward fix failures for torch.export switch to predispatch (pytorch#…
…126081) Summary: Fixes: - executorch test - torchrec test Test Plan: CI Differential Revision: D57282304 Pull Request resolved: pytorch#126081 Approved by: https://github.com/angelayi
Configuration menu - View commit details
-
Copy full SHA for 0f2db1c - Browse repository at this point
Copy the full SHA 0f2db1cView commit details -
Beef up error message for pending assert failure (pytorch#126212)
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126212 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 6b733b2 - Browse repository at this point
Copy the full SHA 6b733b2View commit details -
Enable UFMT format on test/test_utils.py (pytorch#125996)
Fixes some files in pytorch#123062 Run lintrunner on files: test/test_utils.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. Pull Request resolved: pytorch#125996 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 1480537 - Browse repository at this point
Copy the full SHA 1480537View commit details -
Fix aarch64 debug build with GCC (pytorch#126290)
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere
Configuration menu - View commit details
-
Copy full SHA for adf9cc7 - Browse repository at this point
Copy the full SHA adf9cc7View commit details -
Fix public binding to actually traverse modules (pytorch#126103)
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal. There is an unused function just above that handles that, so I guess this is what was supposed to be called. Pull Request resolved: pytorch#126103 Approved by: https://github.com/suo
Configuration menu - View commit details
-
Copy full SHA for 147ba73 - Browse repository at this point
Copy the full SHA 147ba73View commit details -
[FSDP] Fixed docs for inter/intra node PG helpers (pytorch#126288)
1. This fixes an issue where we had 9 ranks in one node and 7 in the other. 2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`. Pull Request resolved: pytorch#126288 Approved by: https://github.com/weifengpy
Configuration menu - View commit details
-
Copy full SHA for 9397380 - Browse repository at this point
Copy the full SHA 9397380View commit details -
Revert "Fix aarch64 debug build with GCC (pytorch#126290)"
This reverts commit a961e1a. Reverted pytorch#126290 on behalf of https://github.com/malfet due to Indeed lint is broken :/ ([comment](pytorch#126290 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 921a824 - Browse repository at this point
Copy the full SHA 921a824View commit details -
Parametrize test_dim_reduction (pytorch#126292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126292 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 0658670 - Browse repository at this point
Copy the full SHA 0658670View commit details -
[DCP] overwrites existing checkpoint by default (pytorch#125877)
Checks for existing checkpoints and overwrites, based on an `overwrite` flag Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/) Pull Request resolved: pytorch#125877 Approved by: https://github.com/fegin
Configuration menu - View commit details
-
Copy full SHA for b5e6220 - Browse repository at this point
Copy the full SHA b5e6220View commit details -
Fix public api allowlist logical merge conflict (pytorch#126321)
Skip the newly added bad API from pytorch#126212 to keep CI green. Pull Request resolved: pytorch#126321 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for a0a6bbc - Browse repository at this point
Copy the full SHA a0a6bbcView commit details -
2 rocm shards on trunk.yml (pytorch#125933)
after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3 Pull Request resolved: pytorch#125933 Approved by: https://github.com/ZainRizvi
Configuration menu - View commit details
-
Copy full SHA for 910f26f - Browse repository at this point
Copy the full SHA 910f26fView commit details -
[FSDP2] allow meta tensors during loading state dict and cpu offloadi…
…ng (pytorch#126267) unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`` with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error Pull Request resolved: pytorch#126267 Approved by: https://github.com/awgu
Configuration menu - View commit details
-
Copy full SHA for 5b4dea2 - Browse repository at this point
Copy the full SHA 5b4dea2View commit details -
[dynamo] Detect monkeypatching on nn module forward method (pytorch#1…
…26203) An alternative was pytorch#124975. Though it was safer because it was adding guards for every inlined function, it was causing guard overhead for a few models of > 20%. The overhead of this PR is minimal for the common unpatched case. Fixes an internal issue - [fb.workplace.com/groups/1075192433118967/permalink/1411067766198097](https://fb.workplace.com/groups/1075192433118967/permalink/1411067766198097/) Pull Request resolved: pytorch#126203 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 3a3f8a9 - Browse repository at this point
Copy the full SHA 3a3f8a9View commit details -
[onnx.export] Avoid unnecessary copy of debug_names (pytorch#123026)
This PR is part of an effort to speed up torch.onnx.export (pytorch#121422). - The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not. - However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior. - Resolves (2) in pytorch#121422. Pull Request resolved: pytorch#123026 Approved by: https://github.com/justinchuby
Configuration menu - View commit details
-
Copy full SHA for 569ee1e - Browse repository at this point
Copy the full SHA 569ee1eView commit details -
Warn SDPA users about dropout behavior (pytorch#126294)
Fixes pytorch#124464 Pull Request resolved: pytorch#126294 Approved by: https://github.com/mikaylagawarecki, https://github.com/drisspg
Configuration menu - View commit details
-
Copy full SHA for 6243a43 - Browse repository at this point
Copy the full SHA 6243a43View commit details -
Improve Storage copy_ size mismatch error message (pytorch#126280)
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126280 Approved by: https://github.com/mikaylagawarecki
Configuration menu - View commit details
-
Copy full SHA for 9e2b899 - Browse repository at this point
Copy the full SHA 9e2b899View commit details -
[CI] Add AMP models in inductor cpu smoketest for performance (pytorc…
…h#125830) Pull Request resolved: pytorch#125830 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/huydhn, https://github.com/desertfire, https://github.com/atalman
Configuration menu - View commit details
-
Copy full SHA for 0d9def0 - Browse repository at this point
Copy the full SHA 0d9def0View commit details -
Remove Caffe2 python code (pytorch#126035)
Follows the recent changes of Caffe2. Pull Request resolved: pytorch#126035 Approved by: https://github.com/r-barnes, https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for eb5e9ed - Browse repository at this point
Copy the full SHA eb5e9edView commit details -
Enable UFMT on
test/test_datapipe.py
(pytorch#124994)Part of: pytorch#123062 Ran lintrunner on: - `test/test_datapipe.py` Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: pytorch#124994 Approved by: https://github.com/mikaylagawarecki
Configuration menu - View commit details
-
Copy full SHA for 05eff35 - Browse repository at this point
Copy the full SHA 05eff35View commit details -
Remove expected failure in
test_eager_transforms.py
(pytorch#125883)Seems to be supported now CC @tinglvv @nWEIdia @Aidyn-A Pull Request resolved: pytorch#125883 Approved by: https://github.com/Chillee, https://github.com/Aidyn-A
Configuration menu - View commit details
-
Copy full SHA for 75add2f - Browse repository at this point
Copy the full SHA 75add2fView commit details -
[optim] Fix: wrong ASGD implementation (pytorch#125440)
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor. - [X] Ill assumption that every param will have the same step. - [x] DIfferent implementation between `foreach=Ture` and `foreach=False`. Pull Request resolved: pytorch#125440 Approved by: https://github.com/janeyx99
Configuration menu - View commit details
-
Copy full SHA for 079d3f5 - Browse repository at this point
Copy the full SHA 079d3f5View commit details -
Fix triton codegen main do_bench_gpu import error (pytorch#126213)
Summary: Encountered module import error when running triton kernel file. The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench". Test Plan: LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=v100,a100,h100 -c fbcode.split-dwarf=true caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 | tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py file ran successfully Differential Revision: D57345619 Pull Request resolved: pytorch#126213 Approved by: https://github.com/shunting314
Configuration menu - View commit details
-
Copy full SHA for 0e22566 - Browse repository at this point
Copy the full SHA 0e22566View commit details -
[dynamo] graph break on const dict KeyError (pytorch#125882)
Fixes pytorch#125866 Pull Request resolved: pytorch#125882 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for 8cc3b81 - Browse repository at this point
Copy the full SHA 8cc3b81View commit details -
[dynamo] graph break on issubclass call with non-const args (pytorch#…
…125943) Fixes pytorch#125942 Pull Request resolved: pytorch#125943 Approved by: https://github.com/jansel ghstack dependencies: pytorch#125882
Configuration menu - View commit details
-
Copy full SHA for 972f76f - Browse repository at this point
Copy the full SHA 972f76fView commit details -
[dynamo] fix pytorch#93624 (pytorch#125945)
Fixes pytorch#93624 but also requires jcmgray/autoray#20 to be fixed. Pull Request resolved: pytorch#125945 Approved by: https://github.com/jansel ghstack dependencies: pytorch#125882, pytorch#125943
Configuration menu - View commit details
-
Copy full SHA for 2524635 - Browse repository at this point
Copy the full SHA 2524635View commit details -
[dynamo][inline-inbuilt-nn-modules] Bug fix - Only unspecialized nn m…
…odules (pytorch#126303) Pull Request resolved: pytorch#126303 Approved by: https://github.com/mlazos, https://github.com/laithsakka
Configuration menu - View commit details
-
Copy full SHA for d3d25a3 - Browse repository at this point
Copy the full SHA d3d25a3View commit details -
[FSDP2] support fully_shard(model_on_meta, cpu_offload) (pytorch#126305)
support fully_shard(model_on_meta, cpu_offload) when fully_shard is placed outside of `torch.device("meta")` Pull Request resolved: pytorch#126305 Approved by: https://github.com/awgu ghstack dependencies: pytorch#126267
Configuration menu - View commit details
-
Copy full SHA for 592bc1f - Browse repository at this point
Copy the full SHA 592bc1fView commit details -
Add VariableTracker.debug_repr (pytorch#126299)
Now you can print arbitrary values at compile time with comptime.print() Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126299 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126292
Configuration menu - View commit details
-
Copy full SHA for 3fe0c6d - Browse repository at this point
Copy the full SHA 3fe0c6dView commit details -
Also remove compile_time_strobelight_meta frame when generating stack (…
…pytorch#126289) I think I also need to fix this in fbcode, leaving that for future work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126289 Approved by: https://github.com/yanboliang
Configuration menu - View commit details
-
Copy full SHA for 421b23d - Browse repository at this point
Copy the full SHA 421b23dView commit details -
Make propagate_real_tensor more safe (pytorch#126281)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/ There a few improvements here, which luckily fix some xfails: * In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it. * We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor. * I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126281 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for db73a01 - Browse repository at this point
Copy the full SHA db73a01View commit details -
Switched from parameter in can_cast to from_. (pytorch#126030)
Fixes pytorch#126012. `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs. If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then. Pull Request resolved: pytorch#126030 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for fed9d93 - Browse repository at this point
Copy the full SHA fed9d93View commit details -
[easy][dynamo][inline-inbuilt-nn-modules] Change test to check for pa…
…rams (pytorch#126316) Pull Request resolved: pytorch#126316 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303
Configuration menu - View commit details
-
Copy full SHA for 8f9fa47 - Browse repository at this point
Copy the full SHA 8f9fa47View commit details -
[Export] Allow ExportedProgram to take empty decomp table (pytorch#12…
…6142) **As title.** Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](https://github.com/pytorch/pytorch/blob/04877dc430a6e93765471b28f422bf3e81d02c9e/torch/_functorch/aot_autograd.py#L456-459) only. **Motivation** We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier. **Testing** CI Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](https://github.com/pytorch/pytorch/blob/04877dc430a6e93765471b28f422bf3e81d02c9e/torch/onnx/_internal/exporter.py#L817) or not. Pull Request resolved: pytorch#126142 Approved by: https://github.com/angelayi
Configuration menu - View commit details
-
Copy full SHA for 7a4c6b9 - Browse repository at this point
Copy the full SHA 7a4c6b9View commit details -
[optim] add fused_adagrad support for CPU device (pytorch#124905)
Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: pytorch#124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99
Configuration menu - View commit details
-
Copy full SHA for 6944593 - Browse repository at this point
Copy the full SHA 6944593View commit details -
[Inductor][Flex-attention] Make num_head support dynamic (pytorch#126342
) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126342 Approved by: https://github.com/drisspg
Configuration menu - View commit details
-
Copy full SHA for 7754cc1 - Browse repository at this point
Copy the full SHA 7754cc1View commit details -
[dynamo][inline-inbuilt-nn-modules] Change test to not depend on id o…
…f mod instance (pytorch#126314) Pull Request resolved: pytorch#126314 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303, pytorch#126316
Configuration menu - View commit details
-
Copy full SHA for 009b5b6 - Browse repository at this point
Copy the full SHA 009b5b6View commit details -
[dynamo][inline-inbuilt-nn-modules] Add and update test_modules.py fo…
…r nlining work (pytorch#126327) Pull Request resolved: pytorch#126327 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303, pytorch#126316, pytorch#126314
Configuration menu - View commit details
-
Copy full SHA for ae2fdc8 - Browse repository at this point
Copy the full SHA ae2fdc8View commit details -
[inductor] [FX graph cache] Ignore unbacked symints in guards express…
…ion (pytorch#126251) Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class. Pull Request resolved: pytorch#126251 Approved by: https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for 675c49f - Browse repository at this point
Copy the full SHA 675c49fView commit details -
Revert "Switched from parameter in can_cast to from_. (pytorch#126030)"
This reverts commit 06d6bb4. Reverted pytorch#126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with pytorch#125995. Please help rebase and I will reland the change ([comment](pytorch#126030 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 930e757 - Browse repository at this point
Copy the full SHA 930e757View commit details -
[inductor][cpp] epilogue support for gemm template (pytorch#126019)
As part of pytorch#125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: pytorch#126019 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for 88643f1 - Browse repository at this point
Copy the full SHA 88643f1View commit details -
[TEST][Dynamo] fix test_deviceguard.py (pytorch#126240)
The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied. Pull Request resolved: pytorch#126240 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for 4417b4c - Browse repository at this point
Copy the full SHA 4417b4cView commit details -
Revert "Remove deprecated _aminmax operator (pytorch#125995)"
This reverts commit 0116ffa. Reverted pytorch#125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](pytorch#125995 (comment)))
Configuration menu - View commit details
-
Copy full SHA for f30d086 - Browse repository at this point
Copy the full SHA f30d086View commit details -
[dynamo][nn module guards] Use TENSOR_MATCH, and not ID_MATCH, for nu…
…mpy tensors (pytorch#126246) Fixes speech_transformer regression here - https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?startTime=Tue%2C%2007%20May%202024%2019%3A22%3A54%20GMT&stopTime=Tue%2C%2014%20May%202024%2019%3A22%3A54%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=02093b6c6ae1046368e2500881d0bb5880873386&rBranch=main&rCommit=b24ad7eab55eaf660893dddae949fc714e434338 Thanks to @eellison and @bdhirsh for isolating the regression to nn module guards. Pull Request resolved: pytorch#126246 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126203
Configuration menu - View commit details
-
Copy full SHA for b8c08b6 - Browse repository at this point
Copy the full SHA b8c08b6View commit details -
[DeviceMesh] Fix hash and eq not match (pytorch#123572)
Fixes pytorch#121799 We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh. Examples can be found here: pytorch#121799 Also need this to unblock pytorch#123394 Pull Request resolved: pytorch#123572 Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu
Configuration menu - View commit details
-
Copy full SHA for 479f3f9 - Browse repository at this point
Copy the full SHA 479f3f9View commit details -
[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilog…
…ue fusion (pytorch#126068) As part of pytorch#125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: pytorch#126068 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126019
Configuration menu - View commit details
-
Copy full SHA for e974908 - Browse repository at this point
Copy the full SHA e974908View commit details -
Initial implementation of AdaRound (pytorch#126153)
Summary: This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568 This algorithm is going to be used by multiple people, hence we need make it official implementation. Differential Revision: D57227565 Pull Request resolved: pytorch#126153 Approved by: https://github.com/jerryzh168
Configuration menu - View commit details
-
Copy full SHA for a4250cc - Browse repository at this point
Copy the full SHA a4250ccView commit details -
Revert "[optim] Fix: wrong ASGD implementation (pytorch#125440)"
This reverts commit 2c5ad9a. Reverted pytorch#125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](pytorch#125440 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 195d01c - Browse repository at this point
Copy the full SHA 195d01cView commit details -
Revert "Initial implementation of AdaRound (pytorch#126153)"
This reverts commit 175c18a. Reverted pytorch#126153 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit because there are more than one lint issues, torch/optim/asgd.py is just the last one ([comment](pytorch#126153 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 4397921 - Browse repository at this point
Copy the full SHA 4397921View commit details -
Add Lowering for FlexAttention Backwards (pytorch#125515)
# Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: pytorch#123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.991 | | | | | Max | 1.182 | (16, 16, 4096, 64) | noop | torch.bfloat16 | | Min | 0.796 | (2, 16, 512, 256) | head_bias | torch.bfloat16 | BWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.291 | | | | | Max | 0.652 | (8, 16, 512, 64) | head_bias | torch.bfloat16 | | Min | 0.073 | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | <details> <summary>Full Data</summary> | shape | score_mod | dtype | fwd_eager_time | fwd_compiled_time | bwd_eager_time | bwd_compiled_time | fwd_speedup | bwd_speedup | |---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------| | (2, 16, 512, 64) | noop | torch.bfloat16 | 19.936 | 19.092 | 57.851 | 193.564 | 1.044 | 0.299 | | (2, 16, 512, 64) | causal_mask | torch.bfloat16 | 19.955 | 19.497 | 57.662 | 206.278 | 1.024 | 0.280 | | (2, 16, 512, 64) | relative_bias | torch.bfloat16 | 19.455 | 21.297 | 57.674 | 195.219 | 0.913 | 0.295 | | (2, 16, 512, 64) | head_bias | torch.bfloat16 | 19.958 | 21.289 | 57.674 | 193.859 | 0.938 | 0.298 | | (2, 16, 512, 128) | noop | torch.bfloat16 | 28.157 | 28.615 | 82.831 | 454.211 | 0.984 | 0.182 | | (2, 16, 512, 128) | causal_mask | torch.bfloat16 | 28.154 | 28.444 | 83.091 | 432.083 | 0.990 | 0.192 | | (2, 16, 512, 128) | relative_bias | torch.bfloat16 | 28.722 | 27.897 | 83.175 | 446.789 | 1.030 | 0.186 | | (2, 16, 512, 128) | head_bias | torch.bfloat16 | 28.299 | 27.673 | 83.052 | 459.179 | 1.023 | 0.181 | | (2, 16, 512, 256) | noop | torch.bfloat16 | 41.167 | 50.504 | 175.019 | 1083.545 | 0.815 | 0.162 | | (2, 16, 512, 256) | causal_mask | torch.bfloat16 | 41.656 | 51.933 | 175.078 | 1171.176 | 0.802 | 0.149 | | (2, 16, 512, 256) | relative_bias | torch.bfloat16 | 41.697 | 50.722 | 175.159 | 1097.312 | 0.822 | 0.160 | | (2, 16, 512, 256) | head_bias | torch.bfloat16 | 41.690 | 52.387 | 175.184 | 1097.336 | 0.796 | 0.160 | | (2, 16, 1024, 64) | noop | torch.bfloat16 | 39.232 | 37.454 | 127.847 | 612.430 | 1.047 | 0.209 | | (2, 16, 1024, 64) | causal_mask | torch.bfloat16 | 39.930 | 39.599 | 127.755 | 665.359 | 1.008 | 0.192 | | (2, 16, 1024, 64) | relative_bias | torch.bfloat16 | 39.417 | 41.304 | 127.902 | 614.990 | 0.954 | 0.208 | | (2, 16, 1024, 64) | head_bias | torch.bfloat16 | 39.965 | 42.034 | 127.953 | 613.273 | 0.951 | 0.209 | | (2, 16, 1024, 128) | noop | torch.bfloat16 | 63.964 | 71.024 | 226.510 | 1637.669 | 0.901 | 0.138 | | (2, 16, 1024, 128) | causal_mask | torch.bfloat16 | 63.843 | 72.451 | 226.750 | 1558.949 | 0.881 | 0.145 | | (2, 16, 1024, 128) | relative_bias | torch.bfloat16 | 64.301 | 70.487 | 226.651 | 1610.063 | 0.912 | 0.141 | | (2, 16, 1024, 128) | head_bias | torch.bfloat16 | 64.033 | 71.394 | 226.676 | 1668.511 | 0.897 | 0.136 | | (2, 16, 1024, 256) | noop | torch.bfloat16 | 129.348 | 141.390 | 507.337 | 4405.175 | 0.915 | 0.115 | | (2, 16, 1024, 256) | causal_mask | torch.bfloat16 | 129.538 | 145.680 | 507.178 | 4768.874 | 0.889 | 0.106 | | (2, 16, 1024, 256) | relative_bias | torch.bfloat16 | 129.438 | 142.782 | 507.004 | 4401.002 | 0.907 | 0.115 | | (2, 16, 1024, 256) | head_bias | torch.bfloat16 | 129.058 | 146.242 | 507.547 | 4434.251 | 0.883 | 0.114 | | (2, 16, 4096, 64) | noop | torch.bfloat16 | 481.606 | 409.120 | 1440.890 | 14147.269 | 1.177 | 0.102 | | (2, 16, 4096, 64) | causal_mask | torch.bfloat16 | 480.227 | 438.847 | 1434.419 | 14973.386 | 1.094 | 0.096 | | (2, 16, 4096, 64) | relative_bias | torch.bfloat16 | 480.831 | 458.104 | 1432.935 | 14193.253 | 1.050 | 0.101 | | (2, 16, 4096, 64) | head_bias | torch.bfloat16 | 480.749 | 452.497 | 1437.040 | 14084.869 | 1.062 | 0.102 | | (2, 16, 4096, 128) | noop | torch.bfloat16 | 872.534 | 848.275 | 2600.895 | 35156.849 | 1.029 | 0.074 | | (2, 16, 4096, 128) | causal_mask | torch.bfloat16 | 872.647 | 868.279 | 2587.581 | 31919.531 | 1.005 | 0.081 | | (2, 16, 4096, 128) | relative_bias | torch.bfloat16 | 871.484 | 827.644 | 2593.989 | 34805.634 | 1.053 | 0.075 | | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | 871.422 | 856.437 | 2602.482 | 35708.591 | 1.017 | 0.073 | | (2, 16, 4096, 256) | noop | torch.bfloat16 | 1904.497 | 1758.183 | 6122.416 | 66754.593 | 1.083 | 0.092 | | (2, 16, 4096, 256) | causal_mask | torch.bfloat16 | 1911.174 | 1762.821 | 6113.207 | 72759.392 | 1.084 | 0.084 | | (2, 16, 4096, 256) | relative_bias | torch.bfloat16 | 1911.254 | 1727.108 | 6123.530 | 66577.988 | 1.107 | 0.092 | | (2, 16, 4096, 256) | head_bias | torch.bfloat16 | 1916.977 | 1801.804 | 6118.158 | 67359.680 | 1.064 | 0.091 | | (8, 16, 512, 64) | noop | torch.bfloat16 | 44.984 | 43.974 | 170.276 | 262.259 | 1.023 | 0.649 | | (8, 16, 512, 64) | causal_mask | torch.bfloat16 | 45.001 | 46.265 | 170.509 | 274.893 | 0.973 | 0.620 | | (8, 16, 512, 64) | relative_bias | torch.bfloat16 | 45.466 | 48.211 | 170.606 | 262.759 | 0.943 | 0.649 | | (8, 16, 512, 64) | head_bias | torch.bfloat16 | 45.481 | 48.435 | 170.267 | 261.265 | 0.939 | 0.652 | | (8, 16, 512, 128) | noop | torch.bfloat16 | 72.565 | 74.736 | 313.220 | 773.126 | 0.971 | 0.405 | | (8, 16, 512, 128) | causal_mask | torch.bfloat16 | 72.015 | 75.755 | 313.311 | 775.513 | 0.951 | 0.404 | | (8, 16, 512, 128) | relative_bias | torch.bfloat16 | 72.105 | 74.189 | 313.806 | 769.238 | 0.972 | 0.408 | | (8, 16, 512, 128) | head_bias | torch.bfloat16 | 72.005 | 74.364 | 313.509 | 775.237 | 0.968 | 0.404 | | (8, 16, 512, 256) | noop | torch.bfloat16 | 138.656 | 165.453 | 663.707 | 2672.067 | 0.838 | 0.248 | | (8, 16, 512, 256) | causal_mask | torch.bfloat16 | 139.096 | 172.613 | 663.593 | 2926.538 | 0.806 | 0.227 | | (8, 16, 512, 256) | relative_bias | torch.bfloat16 | 139.500 | 168.417 | 663.938 | 2658.629 | 0.828 | 0.250 | | (8, 16, 512, 256) | head_bias | torch.bfloat16 | 139.776 | 173.549 | 662.920 | 2667.266 | 0.805 | 0.249 | | (8, 16, 1024, 64) | noop | torch.bfloat16 | 134.883 | 125.004 | 484.706 | 1195.254 | 1.079 | 0.406 | | (8, 16, 1024, 64) | causal_mask | torch.bfloat16 | 134.297 | 132.875 | 485.420 | 1234.953 | 1.011 | 0.393 | | (8, 16, 1024, 64) | relative_bias | torch.bfloat16 | 134.839 | 139.231 | 485.470 | 1198.556 | 0.968 | 0.405 | | (8, 16, 1024, 64) | head_bias | torch.bfloat16 | 133.822 | 136.449 | 485.608 | 1189.198 | 0.981 | 0.408 | | (8, 16, 1024, 128) | noop | torch.bfloat16 | 235.470 | 234.765 | 886.094 | 2662.944 | 1.003 | 0.333 | | (8, 16, 1024, 128) | causal_mask | torch.bfloat16 | 236.305 | 241.382 | 886.293 | 2646.984 | 0.979 | 0.335 | | (8, 16, 1024, 128) | relative_bias | torch.bfloat16 | 236.414 | 233.980 | 885.250 | 2642.178 | 1.010 | 0.335 | | (8, 16, 1024, 128) | head_bias | torch.bfloat16 | 237.176 | 239.040 | 885.754 | 2665.242 | 0.992 | 0.332 | | (8, 16, 1024, 256) | noop | torch.bfloat16 | 504.445 | 517.855 | 1978.956 | 9592.906 | 0.974 | 0.206 | | (8, 16, 1024, 256) | causal_mask | torch.bfloat16 | 502.428 | 536.002 | 1978.611 | 10607.342 | 0.937 | 0.187 | | (8, 16, 1024, 256) | relative_bias | torch.bfloat16 | 503.396 | 523.960 | 1977.993 | 9539.284 | 0.961 | 0.207 | | (8, 16, 1024, 256) | head_bias | torch.bfloat16 | 503.818 | 536.014 | 1980.131 | 9576.262 | 0.940 | 0.207 | | (8, 16, 4096, 64) | noop | torch.bfloat16 | 1970.139 | 1674.930 | 5750.940 | 16724.134 | 1.176 | 0.344 | | (8, 16, 4096, 64) | causal_mask | torch.bfloat16 | 1959.036 | 1775.056 | 5780.512 | 17390.350 | 1.104 | 0.332 | | (8, 16, 4096, 64) | relative_bias | torch.bfloat16 | 1947.198 | 1773.869 | 5780.643 | 16779.699 | 1.098 | 0.345 | | (8, 16, 4096, 64) | head_bias | torch.bfloat16 | 1963.935 | 1829.502 | 5780.018 | 16703.259 | 1.073 | 0.346 | | (8, 16, 4096, 128) | noop | torch.bfloat16 | 3582.711 | 3362.623 | 10436.069 | 36415.565 | 1.065 | 0.287 | | (8, 16, 4096, 128) | causal_mask | torch.bfloat16 | 3581.504 | 3499.472 | 10346.869 | 36164.959 | 1.023 | 0.286 | | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 | 3589.779 | 3337.849 | 10529.621 | 36261.696 | 1.075 | 0.290 | | (8, 16, 4096, 128) | head_bias | torch.bfloat16 | 3602.265 | 3436.444 | 10458.660 | 36507.790 | 1.048 | 0.286 | | (8, 16, 4096, 256) | noop | torch.bfloat16 | 7695.923 | 7126.275 | 24643.009 | 140949.081 | 1.080 | 0.175 | | (8, 16, 4096, 256) | causal_mask | torch.bfloat16 | 7679.939 | 7186.252 | 24538.105 | 157156.067 | 1.069 | 0.156 | | (8, 16, 4096, 256) | relative_bias | torch.bfloat16 | 7681.374 | 6994.832 | 24549.713 | 140077.179 | 1.098 | 0.175 | | (8, 16, 4096, 256) | head_bias | torch.bfloat16 | 7679.822 | 7212.278 | 24627.823 | 140675.003 | 1.065 | 0.175 | | (16, 16, 512, 64) | noop | torch.bfloat16 | 80.126 | 78.291 | 333.719 | 541.165 | 1.023 | 0.617 | | (16, 16, 512, 64) | causal_mask | torch.bfloat16 | 80.065 | 81.696 | 333.779 | 551.113 | 0.980 | 0.606 | | (16, 16, 512, 64) | relative_bias | torch.bfloat16 | 80.138 | 86.715 | 333.364 | 542.118 | 0.924 | 0.615 | | (16, 16, 512, 64) | head_bias | torch.bfloat16 | 80.415 | 85.204 | 333.294 | 536.840 | 0.944 | 0.621 | | (16, 16, 512, 128) | noop | torch.bfloat16 | 134.964 | 138.025 | 607.093 | 1333.102 | 0.978 | 0.455 | | (16, 16, 512, 128) | causal_mask | torch.bfloat16 | 134.192 | 141.523 | 606.269 | 1424.318 | 0.948 | 0.426 | | (16, 16, 512, 128) | relative_bias | torch.bfloat16 | 135.711 | 138.639 | 606.283 | 1327.974 | 0.979 | 0.457 | | (16, 16, 512, 128) | head_bias | torch.bfloat16 | 135.552 | 140.555 | 607.107 | 1347.370 | 0.964 | 0.451 | | (16, 16, 512, 256) | noop | torch.bfloat16 | 275.113 | 315.144 | 1301.583 | 5268.153 | 0.873 | 0.247 | | (16, 16, 512, 256) | causal_mask | torch.bfloat16 | 274.867 | 328.106 | 1302.513 | 5770.594 | 0.838 | 0.226 | | (16, 16, 512, 256) | relative_bias | torch.bfloat16 | 276.052 | 321.770 | 1302.904 | 5241.920 | 0.858 | 0.249 | | (16, 16, 512, 256) | head_bias | torch.bfloat16 | 271.409 | 328.839 | 1302.142 | 5266.037 | 0.825 | 0.247 | | (16, 16, 1024, 64) | noop | torch.bfloat16 | 260.489 | 237.463 | 955.884 | 1817.558 | 1.097 | 0.526 | | (16, 16, 1024, 64) | causal_mask | torch.bfloat16 | 262.378 | 254.350 | 955.280 | 1843.807 | 1.032 | 0.518 | | (16, 16, 1024, 64) | relative_bias | torch.bfloat16 | 261.338 | 268.253 | 956.038 | 1820.036 | 0.974 | 0.525 | | (16, 16, 1024, 64) | head_bias | torch.bfloat16 | 262.153 | 264.156 | 956.023 | 1810.076 | 0.992 | 0.528 | | (16, 16, 1024, 128) | noop | torch.bfloat16 | 476.475 | 461.413 | 1760.578 | 4306.521 | 1.033 | 0.409 | | (16, 16, 1024, 128) | causal_mask | torch.bfloat16 | 473.794 | 479.178 | 1761.277 | 4619.439 | 0.989 | 0.381 | | (16, 16, 1024, 128) | relative_bias | torch.bfloat16 | 473.839 | 463.282 | 1758.692 | 4290.562 | 1.023 | 0.410 | | (16, 16, 1024, 128) | head_bias | torch.bfloat16 | 472.979 | 472.896 | 1763.086 | 4367.931 | 1.000 | 0.404 | | (16, 16, 1024, 256) | noop | torch.bfloat16 | 1014.184 | 1026.764 | 3922.997 | 19104.147 | 0.988 | 0.205 | | (16, 16, 1024, 256) | causal_mask | torch.bfloat16 | 1013.217 | 1039.046 | 3928.382 | 21086.281 | 0.975 | 0.186 | | (16, 16, 1024, 256) | relative_bias | torch.bfloat16 | 1008.519 | 1015.278 | 3922.133 | 18980.652 | 0.993 | 0.207 | | (16, 16, 1024, 256) | head_bias | torch.bfloat16 | 1011.360 | 1047.542 | 3931.245 | 19069.172 | 0.965 | 0.206 | | (16, 16, 4096, 64) | noop | torch.bfloat16 | 3929.850 | 3325.667 | 11411.704 | 23344.280 | 1.182 | 0.489 | | (16, 16, 4096, 64) | causal_mask | torch.bfloat16 | 3885.262 | 3581.544 | 11390.515 | 23725.639 | 1.085 | 0.480 | | (16, 16, 4096, 64) | relative_bias | torch.bfloat16 | 3865.737 | 3537.308 | 11489.901 | 23406.330 | 1.093 | 0.491 | | (16, 16, 4096, 64) | head_bias | torch.bfloat16 | 3880.530 | 3665.249 | 11484.411 | 23299.496 | 1.059 | 0.493 | | (16, 16, 4096, 128) | noop | torch.bfloat16 | 7030.306 | 6745.715 | 20621.264 | 57464.096 | 1.042 | 0.359 | | (16, 16, 4096, 128) | causal_mask | torch.bfloat16 | 7095.414 | 7034.385 | 20410.656 | 61660.511 | 1.009 | 0.331 | | (16, 16, 4096, 128) | relative_bias | torch.bfloat16 | 7084.779 | 6686.497 | 20315.161 | 57243.969 | 1.060 | 0.355 | | (16, 16, 4096, 128) | head_bias | torch.bfloat16 | 7075.367 | 6863.305 | 20494.385 | 58481.953 | 1.031 | 0.350 | | (16, 16, 4096, 256) | noop | torch.bfloat16 | 15612.741 | 14297.482 | 55306.847 | 281161.865 | 1.092 | 0.197 | | (16, 16, 4096, 256) | causal_mask | torch.bfloat16 | 15326.592 | 14263.878 | 55227.806 | 313063.232 | 1.075 | 0.176 | | (16, 16, 4096, 256) | relative_bias | torch.bfloat16 | 15297.963 | 14007.379 | 54558.029 | 279529.175 | 1.092 | 0.195 | | (16, 16, 4096, 256) | head_bias | torch.bfloat16 | 15216.160 | 14276.027 | 55081.581 | 280996.826 | 1.066 | 0.196 | </details> Pull Request resolved: pytorch#125515 Approved by: https://github.com/Chillee
Configuration menu - View commit details
-
Copy full SHA for 22db67f - Browse repository at this point
Copy the full SHA 22db67fView commit details -
[dynamo] Delete extra testing of cpp guard manager (pytorch#126343)
CPP guard manager has been on for a few weeks now. This separate testing was part of phasing when the cpp guard manager was not enabled. Now this is not needed. Pull Request resolved: pytorch#126343 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#126303, pytorch#126316, pytorch#126314, pytorch#126327
Configuration menu - View commit details
-
Copy full SHA for 8dced59 - Browse repository at this point
Copy the full SHA 8dced59View commit details -
fix the device type for with_comms decorator (pytorch#125798)
found by @yifuwang, it looks like we are wrongly using self.device_type="cuda" for gloo backend, which are triggering some flakiness. i.e. pytorch#125366 Pull Request resolved: pytorch#125798 Approved by: https://github.com/yifuwang
Configuration menu - View commit details
-
Copy full SHA for c73f90c - Browse repository at this point
Copy the full SHA c73f90cView commit details -
Add mode to MemoryDep to track atomic accumulates (pytorch#123223)
And allow fusion of buffers where writes are only atomic accumulates. This allows fusing of ops like _unsafe_index_put(_unsafe_index_put(a, ...), ...) Pull Request resolved: pytorch#123223 Approved by: https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for 9f09eae - Browse repository at this point
Copy the full SHA 9f09eaeView commit details -
[c10d] Add an option for NAN check on every collective (pytorch#125726)
Summary: The NAN CHECK is done through device side assert without copying needed from GPU to CPU Test Plan: Unit test for collectives that should experience run time error (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$ python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15: checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)` failed. [rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during checkForNan: device-side assert triggered . ---------------------------------------------------------------------- Ran 1 test in 7.723s OK Tags: Pull Request resolved: pytorch#125726 Approved by: https://github.com/kwen2501
Configuration menu - View commit details
-
Copy full SHA for 2ba6d37 - Browse repository at this point
Copy the full SHA 2ba6d37View commit details -
Generate runtime asserts when propagate real tensor is used (pytorch#…
…126287) This means that propagate real tensor is no longer unsound: if the route we took at compile time diverges with runtime, you will get a runtime assert. Also add structured trace logs for these. Also fix bug where xreplace with int range is not guaranteed to return a sympy expression. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126287 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 8989a88 - Browse repository at this point
Copy the full SHA 8989a88View commit details -
[ez] fix exported diff mismatch (pytorch#126357)
Fixes the following issue: D55803461 differs from the exported PR: pytorch#123658
⚠️ this PR needs to be skipped on diff train! Pull Request resolved: pytorch#126357 Approved by: https://github.com/huydhn, https://github.com/feginConfiguration menu - View commit details
-
Copy full SHA for 1473472 - Browse repository at this point
Copy the full SHA 1473472View commit details -
[Add sliding window attention bias] (pytorch#126061)
Summary: This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met. These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own. Test Plan: Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /*window_size_left*/ Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test Differential Revision: D56938087 Pull Request resolved: pytorch#126061 Approved by: https://github.com/drisspg, https://github.com/desertfire
Configuration menu - View commit details
-
Copy full SHA for 64fb6ed - Browse repository at this point
Copy the full SHA 64fb6edView commit details -
Fix lint failures coming from pytorch#126035 (pytorch#126378)
MYPY somehow shows lots of local failures for me. The issue is tracked in pytorch#126361. This is only to keep trunk sane. These two line were added by pytorch#126035 as an attempt to fix lint there, but didn't seem to help. Pull Request resolved: pytorch#126378 Approved by: https://github.com/kit1980
Configuration menu - View commit details
-
Copy full SHA for 7dab5f7 - Browse repository at this point
Copy the full SHA 7dab5f7View commit details -
[1/N] Non-Tensor: Scalar Support: Enable aot compile to support aten …
…operations with scalar input like alpha (pytorch#124177) Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`. Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - https://github.com/pytorch/pytorch/blob/0f6ce45bcbd7026c00da43db0317ede10830378b/torch/_inductor/compile_fx.py#L1048 This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters. Captured graph for `torch.add(a, b, alpha=2.0)` ``` opcode name target args kwargs ------------- -------- --------------- ---------------- -------------- placeholder arg0_1 arg0_1 () {} placeholder arg1_1 arg1_1 () {} call_function add aten.add.Tensor (arg0_1, arg1_1) {'alpha': 2.0} output output_1 output ((add,),) {} ``` Pull Request resolved: pytorch#124177 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5
Configuration menu - View commit details
-
Copy full SHA for fb2c753 - Browse repository at this point
Copy the full SHA fb2c753View commit details -
[Doc] Add deprecated autocast comments for doc (pytorch#126062)
# Motivation We generalize a device-agnostic API `torch.amp.autocast` in [pytorch#125103](pytorch#125103). After that, - `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and - `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)` no matter in eager mode or JIT mode. Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API. Pull Request resolved: pytorch#126062 Approved by: https://github.com/eqy, https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for 45d93f9 - Browse repository at this point
Copy the full SHA 45d93f9View commit details -
Revert "Fix lint failures coming from pytorch#126035 (pytorch#126378)"
This reverts commit 5fa1f4c. Reverted pytorch#126378 on behalf of https://github.com/huydhn due to Trying to add yet another lint fix from https://hud.pytorch.org/pr/pytorch/pytorch/126357 and will reland this ([comment](pytorch#126378 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 75289f2 - Browse repository at this point
Copy the full SHA 75289f2View commit details -
Revert "Add Lowering for FlexAttention Backwards (pytorch#125515)"
This reverts commit 95b9e98. Reverted pytorch#125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory https://hud.pytorch.org/pytorch/pytorch/commit/95b9e981c3ab68fc17f78b8a6bbfd9569745ae4c ([comment](pytorch#125515 (comment)))
Configuration menu - View commit details
-
Copy full SHA for dd2f8d1 - Browse repository at this point
Copy the full SHA dd2f8d1View commit details -
Fix lint failures coming from pytorch#126035 (pytorch#126378)
MYPY somehow shows lots of local failures for me. The issue is tracked in pytorch#126361. This is only to keep trunk sane. These two line were added by pytorch#126035 as an attempt to fix lint there, but didn't seem to help. Pull Request resolved: pytorch#126378 Approved by: https://github.com/kit1980
Configuration menu - View commit details
-
Copy full SHA for adc0551 - Browse repository at this point
Copy the full SHA adc0551View commit details -
[Traceable FSDP2] Add all_gather_into_tensor out variant (pytorch#126334
) This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`. It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage. The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users. Pull Request resolved: pytorch#126334 Approved by: https://github.com/yifuwang, https://github.com/wanchaol
Configuration menu - View commit details
-
Copy full SHA for 64efc14 - Browse repository at this point
Copy the full SHA 64efc14View commit details -
Fix broken link of scikit-learn (pytorch#120972)
The link is broken in https://pytorch.org/docs/main/community/design.html Pull Request resolved: pytorch#120972 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 0ddafc0 - Browse repository at this point
Copy the full SHA 0ddafc0View commit details -
[Reopen] Upgrade submodule oneDNN to v3.4.2 (pytorch#126137)
Reopen of pytorch#122472 ## Improvements This upgrade fixes the following issues: - pytorch#120982 This upgrade brings the following new features: - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (pytorch#114450) ## Validation results on CPU Original results with oneDNN v3.4.1 are here: pytorch#122472 (comment) Need to rerun validation and update results. Co-authored-by: Sunita Nadampalli <nadampal@amazon.com> Pull Request resolved: pytorch#126137 Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman
Configuration menu - View commit details
-
Copy full SHA for 8288174 - Browse repository at this point
Copy the full SHA 8288174View commit details -
[FSDP2] Supported
set_all_reduce_gradients=False
for HSDP (pytorch#……126166) **Context** For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients). - FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`. - FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`. For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2). - FSDP2 offers (1) without any intervention like mentioned above. - FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above. - FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`. **Overview** For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like: ``` for microbatch_idx, microbatch in enumerate(microbatches): is_last_microbatch = microbatch_idx == len(microbatches) - 1 model.set_requires_all_reduce(is_last_microbatch) # Run forward/backward ``` This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only. **Developer Notes** We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output. Pull Request resolved: pytorch#126166 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: pytorch#126067, pytorch#126070, pytorch#126161
Configuration menu - View commit details
-
Copy full SHA for 5dd875a - Browse repository at this point
Copy the full SHA 5dd875aView commit details -
Fix aarch64 debug build with GCC (pytorch#126290)
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere
Configuration menu - View commit details
-
Copy full SHA for cebb5df - Browse repository at this point
Copy the full SHA cebb5dfView commit details -
Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (pytorch#126336
) Fixes pytorch#125504 Fixes pytorch#126252 Fixes pytorch#126296 Fixes pytorch#126330 This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR. Pull Request resolved: pytorch#126336 Approved by: https://github.com/huydhn, https://github.com/pruthvistony
Configuration menu - View commit details
-
Copy full SHA for 60fb3ef - Browse repository at this point
Copy the full SHA 60fb3efView commit details -
[ROCm] amax hipblaslt integration (pytorch#125921)
AMAX is coming as part of rocm6.2. This code adds that functionality Pull Request resolved: pytorch#125921 Approved by: https://github.com/eqy, https://github.com/lezcano
Configuration menu - View commit details
-
Copy full SHA for 9df7bda - Browse repository at this point
Copy the full SHA 9df7bdaView commit details -
Add 2nd shard to ROCm trunk workflow for core distributed UTs (pytorc…
…h#121716) Pull Request resolved: pytorch#121716 Approved by: https://github.com/ezyang, https://github.com/huydhn
Configuration menu - View commit details
-
Copy full SHA for 19dfbce - Browse repository at this point
Copy the full SHA 19dfbceView commit details -
[AOTI][torchgen] Support at::Generator via C shim (pytorch#126181)
Summary: Support at::Generator which is used by many random number generator ops Pull Request resolved: pytorch#126181 Approved by: https://github.com/chenyang78
Configuration menu - View commit details
-
Copy full SHA for 7e7392b - Browse repository at this point
Copy the full SHA 7e7392bView commit details -
[AOTI] Refactor some fallback op util functions (pytorch#126182)
Summary: Move some util functions for cpp kernel naming and missing arg filling from FallbackKernel to ExternKernel, since they are useful for ExternKernel in general. Pull Request resolved: pytorch#126182 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#126181
Configuration menu - View commit details
-
Copy full SHA for 27b7381 - Browse repository at this point
Copy the full SHA 27b7381View commit details -
[AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (…
…pytorch#126183) Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes pytorch#121809 Pull Request resolved: pytorch#126183 Approved by: https://github.com/angelayi ghstack dependencies: pytorch#126181, pytorch#126182
Configuration menu - View commit details
-
Copy full SHA for d27e21d - Browse repository at this point
Copy the full SHA d27e21dView commit details -
[AOTI][refactor] Add aoti_torch_item as a util function (pytorch#126352)
Summary: The logic has been repeated several times in the code, so it's worth to write a common util function. Pull Request resolved: pytorch#126352 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#126181, pytorch#126182, pytorch#126183
Configuration menu - View commit details
-
Copy full SHA for 272b119 - Browse repository at this point
Copy the full SHA 272b119View commit details -
[BE][FSDP] Change the logging level to info (pytorch#126362)
As title Differential Revision: [D57419445](https://our.internmc.facebook.com/intern/diff/D57419445/) Pull Request resolved: pytorch#126362 Approved by: https://github.com/awgu, https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 667af78 - Browse repository at this point
Copy the full SHA 667af78View commit details -
[BE][FSDP] Remove unnecessary warnings (pytorch#126365)
As title Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/) Pull Request resolved: pytorch#126365 Approved by: https://github.com/awgu, https://github.com/Skylion007 ghstack dependencies: pytorch#126362
Configuration menu - View commit details
-
Copy full SHA for 08e5a7e - Browse repository at this point
Copy the full SHA 08e5a7eView commit details -
[onnx.export] Cache SetGraphInputTypeReliable (pytorch#124912)
This PR is part of an effort to speed up torch.onnx.export (pytorch#121422). - For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state. - Resolves (6) in pytorch#121422. - Also see pytorch#123028 with a similar addition of a cache state. (partial fix of pytorch#121545) Pull Request resolved: pytorch#124912 Approved by: https://github.com/justinchuby
Configuration menu - View commit details
-
Copy full SHA for 2a34465 - Browse repository at this point
Copy the full SHA 2a34465View commit details -
Remove redundant serialization code (pytorch#126249)
After pytorch#123308, we no longer need separate serialization path to handle different types that exist in the `nn_module` metadata. This PR cleans up the redundant code. Pull Request resolved: pytorch#126249 Approved by: https://github.com/angelayi
Configuration menu - View commit details
-
Copy full SHA for 45a699a - Browse repository at this point
Copy the full SHA 45a699aView commit details -
[Dynamo] Support SET_UPDATE (pytorch#126243)
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126243 Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for b24a9e3 - Browse repository at this point
Copy the full SHA b24a9e3View commit details -
xpu: implement xpu serialization (pytorch#125530)
Fixes: pytorch#125529 BC-breaking note: The deprecated "async" argument to the Storage.cuda and Storage.hpu has been removed. Use non_blocking instead. CC: @jbschlosser, @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @albanD Pull Request resolved: pytorch#125530 Approved by: https://github.com/guangyey, https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for a2e563d - Browse repository at this point
Copy the full SHA a2e563dView commit details -
Don't install inplace_methods on MockHandler, not needed (pytorch#126398
) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126398 Approved by: https://github.com/jansel, https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for 4c93c7a - Browse repository at this point
Copy the full SHA 4c93c7aView commit details -
Make 'pytest test/inductor/test_memory_planning.py' work (pytorch#126397
) There's still another naughty direct test_* import, I'm out of patience right now though. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126397 Approved by: https://github.com/peterbell10, https://github.com/int3
Configuration menu - View commit details
-
Copy full SHA for f1897d4 - Browse repository at this point
Copy the full SHA f1897d4View commit details -
Switched from parameter in can_cast to from_. (pytorch#126030)
Fixes pytorch#126012. `from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs. If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then. Pull Request resolved: pytorch#126030 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for f4daf9e - Browse repository at this point
Copy the full SHA f4daf9eView commit details -
[Traceable FSDP2] Use DTensor.from_local() in _from_local_no_grad whe…
…n compile (pytorch#126346) As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`. This won't affect eager and it's a compile-only change. Pull Request resolved: pytorch#126346 Approved by: https://github.com/awgu
Configuration menu - View commit details
-
Copy full SHA for 74ad455 - Browse repository at this point
Copy the full SHA 74ad455View commit details -
Fix strict default value in StateDictOptions (pytorch#125998)
Fixes pytorch#125992 The default value of the parameter `strict` should be `True`. Pull Request resolved: pytorch#125998 Approved by: https://github.com/fegin
Configuration menu - View commit details
-
Copy full SHA for 255ae5d - Browse repository at this point
Copy the full SHA 255ae5dView commit details -
Print export warning only once in capture_pre_autograd (pytorch#126403)
Summary: Missed this in D57163341 Test Plan: CI Differential Revision: D57442088 Pull Request resolved: pytorch#126403 Approved by: https://github.com/zhxchen17
Configuration menu - View commit details
-
Copy full SHA for 1367209 - Browse repository at this point
Copy the full SHA 1367209View commit details -
[compiled autograd] Fix LoggingTensor flaky test (pytorch#126144)
LoggingTensor fails consistently when root logger level is INFO or lower By default, root logger should be WARNING But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: pytorch#126143 Pull Request resolved: pytorch#126144 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for 80798a7 - Browse repository at this point
Copy the full SHA 80798a7View commit details -
[inductor] Clear cache on ctx manager exit (pytorch#126146)
FIXES pytorch#126128. Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again, usually fine in tests. Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd. TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't Pull Request resolved: pytorch#126146 Approved by: https://github.com/jgong5, https://github.com/oulgen ghstack dependencies: pytorch#126144
Configuration menu - View commit details
-
Copy full SHA for b2efbae - Browse repository at this point
Copy the full SHA b2efbaeView commit details -
[compiled autograd] clear compiled_autograd_verbose once test is done (…
…pytorch#126148) verbose flag leaks into tests ran after Pull Request resolved: pytorch#126148 Approved by: https://github.com/jansel ghstack dependencies: pytorch#126144, pytorch#126146
Configuration menu - View commit details
-
Copy full SHA for b29fd1f - Browse repository at this point
Copy the full SHA b29fd1fView commit details -
add 3.12 inductor CI tests (pytorch#126218)
Pull Request resolved: pytorch#126218 Approved by: https://github.com/huydhn, https://github.com/desertfire
Configuration menu - View commit details
-
Copy full SHA for 19e7924 - Browse repository at this point
Copy the full SHA 19e7924View commit details -
Eliminate some C++11 checks (pytorch#126308)
Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D57246912 Pull Request resolved: pytorch#126308 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for cd76785 - Browse repository at this point
Copy the full SHA cd76785View commit details -
Add prefix option to CapabilityBasedPartitioner (pytorch#126382)
Summary: Add prefix arg so that users can provide the submodule name to partitioner. Test Plan: https://fburl.com/anp/2kue4qp9 Differential Revision: D57416926 Pull Request resolved: pytorch#126382 Approved by: https://github.com/SherlockNoMad
Configuration menu - View commit details
-
Copy full SHA for 2b7ac1e - Browse repository at this point
Copy the full SHA 2b7ac1eView commit details -
Import MKL via //third-party/mkl targets (pytorch#126371)
Summary: This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2. - Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`. - Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]` Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*". Test Plan: sandcastle Differential Revision: D57360438 Pull Request resolved: pytorch#126371 Approved by: https://github.com/bertmaher
Configuration menu - View commit details
-
Copy full SHA for 1948225 - Browse repository at this point
Copy the full SHA 1948225View commit details -
[c10d] add pg_name and pg_desc to logger (pytorch#126409)
Summary: This should further improve our debuggability Tags: Pull Request resolved: pytorch#126409 Approved by: https://github.com/XilunWu
Configuration menu - View commit details
-
Copy full SHA for 3bbd7fa - Browse repository at this point
Copy the full SHA 3bbd7faView commit details -
Use object identity for deepcopy memo (pytorch#126126)
Copy of pytorch#126089, with some additional fixes & tests Partial fix for pytorch#125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation. The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes: * (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying * (still kind of wrong): views won't actually alias each other after deepcopying. * (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias. BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++. Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306) Pull Request resolved: pytorch#126126 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for ac162de - Browse repository at this point
Copy the full SHA ac162deView commit details -
Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/…
…o epilogue fusion (pytorch#126068)" This reverts commit 927e631. Reverted pytorch#126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR pytorch#124021 is going to be revert ([comment](pytorch#126019 (comment)))
Configuration menu - View commit details
-
Copy full SHA for fa207b5 - Browse repository at this point
Copy the full SHA fa207b5View commit details -
Revert "[inductor][cpp] epilogue support for gemm template (pytorch#1…
…26019)" This reverts commit 7844c20. Reverted pytorch#126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR pytorch#124021 is going to be revert ([comment](pytorch#126019 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 8f51cf7 - Browse repository at this point
Copy the full SHA 8f51cf7View commit details -
Revert "[inductor][cpp] GEMM template (infra and fp32) (pytorch#124021)"
This reverts commit f060b0c. Reverted pytorch#124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](pytorch#124021 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 2a6c92a - Browse repository at this point
Copy the full SHA 2a6c92aView commit details -
Add Lowering for FlexAttention Backwards (pytorch#125515)
# Summary #### What does this PR do? It enables Inductor to actually generate the fused flex attention kernel for the backwards I did some other things along the way: - Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel. - The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization. - I didnt correctly register the decomp table + IndexMode when I landed: pytorch#123902, this remedies that. - The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention. - This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk' - I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications) - I updated the benchmark to also profile bwds performance ### Benchmark Numbers: _The current implementation is not parallelizing over ctx length in the bwd_ FWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.991 | | | | | Max | 1.182 | (16, 16, 4096, 64) | noop | torch.bfloat16 | | Min | 0.796 | (2, 16, 512, 256) | head_bias | torch.bfloat16 | BWD Speedups | Type | Speedup | shape | score_mod | dtype | |---------|-----------|--------------------|-------------|----------------| | Average | 0.291 | | | | | Max | 0.652 | (8, 16, 512, 64) | head_bias | torch.bfloat16 | | Min | 0.073 | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | <details> <summary>Full Data</summary> | shape | score_mod | dtype | fwd_eager_time | fwd_compiled_time | bwd_eager_time | bwd_compiled_time | fwd_speedup | bwd_speedup | |---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------| | (2, 16, 512, 64) | noop | torch.bfloat16 | 19.936 | 19.092 | 57.851 | 193.564 | 1.044 | 0.299 | | (2, 16, 512, 64) | causal_mask | torch.bfloat16 | 19.955 | 19.497 | 57.662 | 206.278 | 1.024 | 0.280 | | (2, 16, 512, 64) | relative_bias | torch.bfloat16 | 19.455 | 21.297 | 57.674 | 195.219 | 0.913 | 0.295 | | (2, 16, 512, 64) | head_bias | torch.bfloat16 | 19.958 | 21.289 | 57.674 | 193.859 | 0.938 | 0.298 | | (2, 16, 512, 128) | noop | torch.bfloat16 | 28.157 | 28.615 | 82.831 | 454.211 | 0.984 | 0.182 | | (2, 16, 512, 128) | causal_mask | torch.bfloat16 | 28.154 | 28.444 | 83.091 | 432.083 | 0.990 | 0.192 | | (2, 16, 512, 128) | relative_bias | torch.bfloat16 | 28.722 | 27.897 | 83.175 | 446.789 | 1.030 | 0.186 | | (2, 16, 512, 128) | head_bias | torch.bfloat16 | 28.299 | 27.673 | 83.052 | 459.179 | 1.023 | 0.181 | | (2, 16, 512, 256) | noop | torch.bfloat16 | 41.167 | 50.504 | 175.019 | 1083.545 | 0.815 | 0.162 | | (2, 16, 512, 256) | causal_mask | torch.bfloat16 | 41.656 | 51.933 | 175.078 | 1171.176 | 0.802 | 0.149 | | (2, 16, 512, 256) | relative_bias | torch.bfloat16 | 41.697 | 50.722 | 175.159 | 1097.312 | 0.822 | 0.160 | | (2, 16, 512, 256) | head_bias | torch.bfloat16 | 41.690 | 52.387 | 175.184 | 1097.336 | 0.796 | 0.160 | | (2, 16, 1024, 64) | noop | torch.bfloat16 | 39.232 | 37.454 | 127.847 | 612.430 | 1.047 | 0.209 | | (2, 16, 1024, 64) | causal_mask | torch.bfloat16 | 39.930 | 39.599 | 127.755 | 665.359 | 1.008 | 0.192 | | (2, 16, 1024, 64) | relative_bias | torch.bfloat16 | 39.417 | 41.304 | 127.902 | 614.990 | 0.954 | 0.208 | | (2, 16, 1024, 64) | head_bias | torch.bfloat16 | 39.965 | 42.034 | 127.953 | 613.273 | 0.951 | 0.209 | | (2, 16, 1024, 128) | noop | torch.bfloat16 | 63.964 | 71.024 | 226.510 | 1637.669 | 0.901 | 0.138 | | (2, 16, 1024, 128) | causal_mask | torch.bfloat16 | 63.843 | 72.451 | 226.750 | 1558.949 | 0.881 | 0.145 | | (2, 16, 1024, 128) | relative_bias | torch.bfloat16 | 64.301 | 70.487 | 226.651 | 1610.063 | 0.912 | 0.141 | | (2, 16, 1024, 128) | head_bias | torch.bfloat16 | 64.033 | 71.394 | 226.676 | 1668.511 | 0.897 | 0.136 | | (2, 16, 1024, 256) | noop | torch.bfloat16 | 129.348 | 141.390 | 507.337 | 4405.175 | 0.915 | 0.115 | | (2, 16, 1024, 256) | causal_mask | torch.bfloat16 | 129.538 | 145.680 | 507.178 | 4768.874 | 0.889 | 0.106 | | (2, 16, 1024, 256) | relative_bias | torch.bfloat16 | 129.438 | 142.782 | 507.004 | 4401.002 | 0.907 | 0.115 | | (2, 16, 1024, 256) | head_bias | torch.bfloat16 | 129.058 | 146.242 | 507.547 | 4434.251 | 0.883 | 0.114 | | (2, 16, 4096, 64) | noop | torch.bfloat16 | 481.606 | 409.120 | 1440.890 | 14147.269 | 1.177 | 0.102 | | (2, 16, 4096, 64) | causal_mask | torch.bfloat16 | 480.227 | 438.847 | 1434.419 | 14973.386 | 1.094 | 0.096 | | (2, 16, 4096, 64) | relative_bias | torch.bfloat16 | 480.831 | 458.104 | 1432.935 | 14193.253 | 1.050 | 0.101 | | (2, 16, 4096, 64) | head_bias | torch.bfloat16 | 480.749 | 452.497 | 1437.040 | 14084.869 | 1.062 | 0.102 | | (2, 16, 4096, 128) | noop | torch.bfloat16 | 872.534 | 848.275 | 2600.895 | 35156.849 | 1.029 | 0.074 | | (2, 16, 4096, 128) | causal_mask | torch.bfloat16 | 872.647 | 868.279 | 2587.581 | 31919.531 | 1.005 | 0.081 | | (2, 16, 4096, 128) | relative_bias | torch.bfloat16 | 871.484 | 827.644 | 2593.989 | 34805.634 | 1.053 | 0.075 | | (2, 16, 4096, 128) | head_bias | torch.bfloat16 | 871.422 | 856.437 | 2602.482 | 35708.591 | 1.017 | 0.073 | | (2, 16, 4096, 256) | noop | torch.bfloat16 | 1904.497 | 1758.183 | 6122.416 | 66754.593 | 1.083 | 0.092 | | (2, 16, 4096, 256) | causal_mask | torch.bfloat16 | 1911.174 | 1762.821 | 6113.207 | 72759.392 | 1.084 | 0.084 | | (2, 16, 4096, 256) | relative_bias | torch.bfloat16 | 1911.254 | 1727.108 | 6123.530 | 66577.988 | 1.107 | 0.092 | | (2, 16, 4096, 256) | head_bias | torch.bfloat16 | 1916.977 | 1801.804 | 6118.158 | 67359.680 | 1.064 | 0.091 | | (8, 16, 512, 64) | noop | torch.bfloat16 | 44.984 | 43.974 | 170.276 | 262.259 | 1.023 | 0.649 | | (8, 16, 512, 64) | causal_mask | torch.bfloat16 | 45.001 | 46.265 | 170.509 | 274.893 | 0.973 | 0.620 | | (8, 16, 512, 64) | relative_bias | torch.bfloat16 | 45.466 | 48.211 | 170.606 | 262.759 | 0.943 | 0.649 | | (8, 16, 512, 64) | head_bias | torch.bfloat16 | 45.481 | 48.435 | 170.267 | 261.265 | 0.939 | 0.652 | | (8, 16, 512, 128) | noop | torch.bfloat16 | 72.565 | 74.736 | 313.220 | 773.126 | 0.971 | 0.405 | | (8, 16, 512, 128) | causal_mask | torch.bfloat16 | 72.015 | 75.755 | 313.311 | 775.513 | 0.951 | 0.404 | | (8, 16, 512, 128) | relative_bias | torch.bfloat16 | 72.105 | 74.189 | 313.806 | 769.238 | 0.972 | 0.408 | | (8, 16, 512, 128) | head_bias | torch.bfloat16 | 72.005 | 74.364 | 313.509 | 775.237 | 0.968 | 0.404 | | (8, 16, 512, 256) | noop | torch.bfloat16 | 138.656 | 165.453 | 663.707 | 2672.067 | 0.838 | 0.248 | | (8, 16, 512, 256) | causal_mask | torch.bfloat16 | 139.096 | 172.613 | 663.593 | 2926.538 | 0.806 | 0.227 | | (8, 16, 512, 256) | relative_bias | torch.bfloat16 | 139.500 | 168.417 | 663.938 | 2658.629 | 0.828 | 0.250 | | (8, 16, 512, 256) | head_bias | torch.bfloat16 | 139.776 | 173.549 | 662.920 | 2667.266 | 0.805 | 0.249 | | (8, 16, 1024, 64) | noop | torch.bfloat16 | 134.883 | 125.004 | 484.706 | 1195.254 | 1.079 | 0.406 | | (8, 16, 1024, 64) | causal_mask | torch.bfloat16 | 134.297 | 132.875 | 485.420 | 1234.953 | 1.011 | 0.393 | | (8, 16, 1024, 64) | relative_bias | torch.bfloat16 | 134.839 | 139.231 | 485.470 | 1198.556 | 0.968 | 0.405 | | (8, 16, 1024, 64) | head_bias | torch.bfloat16 | 133.822 | 136.449 | 485.608 | 1189.198 | 0.981 | 0.408 | | (8, 16, 1024, 128) | noop | torch.bfloat16 | 235.470 | 234.765 | 886.094 | 2662.944 | 1.003 | 0.333 | | (8, 16, 1024, 128) | causal_mask | torch.bfloat16 | 236.305 | 241.382 | 886.293 | 2646.984 | 0.979 | 0.335 | | (8, 16, 1024, 128) | relative_bias | torch.bfloat16 | 236.414 | 233.980 | 885.250 | 2642.178 | 1.010 | 0.335 | | (8, 16, 1024, 128) | head_bias | torch.bfloat16 | 237.176 | 239.040 | 885.754 | 2665.242 | 0.992 | 0.332 | | (8, 16, 1024, 256) | noop | torch.bfloat16 | 504.445 | 517.855 | 1978.956 | 9592.906 | 0.974 | 0.206 | | (8, 16, 1024, 256) | causal_mask | torch.bfloat16 | 502.428 | 536.002 | 1978.611 | 10607.342 | 0.937 | 0.187 | | (8, 16, 1024, 256) | relative_bias | torch.bfloat16 | 503.396 | 523.960 | 1977.993 | 9539.284 | 0.961 | 0.207 | | (8, 16, 1024, 256) | head_bias | torch.bfloat16 | 503.818 | 536.014 | 1980.131 | 9576.262 | 0.940 | 0.207 | | (8, 16, 4096, 64) | noop | torch.bfloat16 | 1970.139 | 1674.930 | 5750.940 | 16724.134 | 1.176 | 0.344 | | (8, 16, 4096, 64) | causal_mask | torch.bfloat16 | 1959.036 | 1775.056 | 5780.512 | 17390.350 | 1.104 | 0.332 | | (8, 16, 4096, 64) | relative_bias | torch.bfloat16 | 1947.198 | 1773.869 | 5780.643 | 16779.699 | 1.098 | 0.345 | | (8, 16, 4096, 64) | head_bias | torch.bfloat16 | 1963.935 | 1829.502 | 5780.018 | 16703.259 | 1.073 | 0.346 | | (8, 16, 4096, 128) | noop | torch.bfloat16 | 3582.711 | 3362.623 | 10436.069 | 36415.565 | 1.065 | 0.287 | | (8, 16, 4096, 128) | causal_mask | torch.bfloat16 | 3581.504 | 3499.472 | 10346.869 | 36164.959 | 1.023 | 0.286 | | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 | 3589.779 | 3337.849 | 10529.621 | 36261.696 | 1.075 | 0.290 | | (8, 16, 4096, 128) | head_bias | torch.bfloat16 | 3602.265 | 3436.444 | 10458.660 | 36507.790 | 1.048 | 0.286 | | (8, 16, 4096, 256) | noop | torch.bfloat16 | 7695.923 | 7126.275 | 24643.009 | 140949.081 | 1.080 | 0.175 | | (8, 16, 4096, 256) | causal_mask | torch.bfloat16 | 7679.939 | 7186.252 | 24538.105 | 157156.067 | 1.069 | 0.156 | | (8, 16, 4096, 256) | relative_bias | torch.bfloat16 | 7681.374 | 6994.832 | 24549.713 | 140077.179 | 1.098 | 0.175 | | (8, 16, 4096, 256) | head_bias | torch.bfloat16 | 7679.822 | 7212.278 | 24627.823 | 140675.003 | 1.065 | 0.175 | | (16, 16, 512, 64) | noop | torch.bfloat16 | 80.126 | 78.291 | 333.719 | 541.165 | 1.023 | 0.617 | | (16, 16, 512, 64) | causal_mask | torch.bfloat16 | 80.065 | 81.696 | 333.779 | 551.113 | 0.980 | 0.606 | | (16, 16, 512, 64) | relative_bias | torch.bfloat16 | 80.138 | 86.715 | 333.364 | 542.118 | 0.924 | 0.615 | | (16, 16, 512, 64) | head_bias | torch.bfloat16 | 80.415 | 85.204 | 333.294 | 536.840 | 0.944 | 0.621 | | (16, 16, 512, 128) | noop | torch.bfloat16 | 134.964 | 138.025 | 607.093 | 1333.102 | 0.978 | 0.455 | | (16, 16, 512, 128) | causal_mask | torch.bfloat16 | 134.192 | 141.523 | 606.269 | 1424.318 | 0.948 | 0.426 | | (16, 16, 512, 128) | relative_bias | torch.bfloat16 | 135.711 | 138.639 | 606.283 | 1327.974 | 0.979 | 0.457 | | (16, 16, 512, 128) | head_bias | torch.bfloat16 | 135.552 | 140.555 | 607.107 | 1347.370 | 0.964 | 0.451 | | (16, 16, 512, 256) | noop | torch.bfloat16 | 275.113 | 315.144 | 1301.583 | 5268.153 | 0.873 | 0.247 | | (16, 16, 512, 256) | causal_mask | torch.bfloat16 | 274.867 | 328.106 | 1302.513 | 5770.594 | 0.838 | 0.226 | | (16, 16, 512, 256) | relative_bias | torch.bfloat16 | 276.052 | 321.770 | 1302.904 | 5241.920 | 0.858 | 0.249 | | (16, 16, 512, 256) | head_bias | torch.bfloat16 | 271.409 | 328.839 | 1302.142 | 5266.037 | 0.825 | 0.247 | | (16, 16, 1024, 64) | noop | torch.bfloat16 | 260.489 | 237.463 | 955.884 | 1817.558 | 1.097 | 0.526 | | (16, 16, 1024, 64) | causal_mask | torch.bfloat16 | 262.378 | 254.350 | 955.280 | 1843.807 | 1.032 | 0.518 | | (16, 16, 1024, 64) | relative_bias | torch.bfloat16 | 261.338 | 268.253 | 956.038 | 1820.036 | 0.974 | 0.525 | | (16, 16, 1024, 64) | head_bias | torch.bfloat16 | 262.153 | 264.156 | 956.023 | 1810.076 | 0.992 | 0.528 | | (16, 16, 1024, 128) | noop | torch.bfloat16 | 476.475 | 461.413 | 1760.578 | 4306.521 | 1.033 | 0.409 | | (16, 16, 1024, 128) | causal_mask | torch.bfloat16 | 473.794 | 479.178 | 1761.277 | 4619.439 | 0.989 | 0.381 | | (16, 16, 1024, 128) | relative_bias | torch.bfloat16 | 473.839 | 463.282 | 1758.692 | 4290.562 | 1.023 | 0.410 | | (16, 16, 1024, 128) | head_bias | torch.bfloat16 | 472.979 | 472.896 | 1763.086 | 4367.931 | 1.000 | 0.404 | | (16, 16, 1024, 256) | noop | torch.bfloat16 | 1014.184 | 1026.764 | 3922.997 | 19104.147 | 0.988 | 0.205 | | (16, 16, 1024, 256) | causal_mask | torch.bfloat16 | 1013.217 | 1039.046 | 3928.382 | 21086.281 | 0.975 | 0.186 | | (16, 16, 1024, 256) | relative_bias | torch.bfloat16 | 1008.519 | 1015.278 | 3922.133 | 18980.652 | 0.993 | 0.207 | | (16, 16, 1024, 256) | head_bias | torch.bfloat16 | 1011.360 | 1047.542 | 3931.245 | 19069.172 | 0.965 | 0.206 | | (16, 16, 4096, 64) | noop | torch.bfloat16 | 3929.850 | 3325.667 | 11411.704 | 23344.280 | 1.182 | 0.489 | | (16, 16, 4096, 64) | causal_mask | torch.bfloat16 | 3885.262 | 3581.544 | 11390.515 | 23725.639 | 1.085 | 0.480 | | (16, 16, 4096, 64) | relative_bias | torch.bfloat16 | 3865.737 | 3537.308 | 11489.901 | 23406.330 | 1.093 | 0.491 | | (16, 16, 4096, 64) | head_bias | torch.bfloat16 | 3880.530 | 3665.249 | 11484.411 | 23299.496 | 1.059 | 0.493 | | (16, 16, 4096, 128) | noop | torch.bfloat16 | 7030.306 | 6745.715 | 20621.264 | 57464.096 | 1.042 | 0.359 | | (16, 16, 4096, 128) | causal_mask | torch.bfloat16 | 7095.414 | 7034.385 | 20410.656 | 61660.511 | 1.009 | 0.331 | | (16, 16, 4096, 128) | relative_bias | torch.bfloat16 | 7084.779 | 6686.497 | 20315.161 | 57243.969 | 1.060 | 0.355 | | (16, 16, 4096, 128) | head_bias | torch.bfloat16 | 7075.367 | 6863.305 | 20494.385 | 58481.953 | 1.031 | 0.350 | | (16, 16, 4096, 256) | noop | torch.bfloat16 | 15612.741 | 14297.482 | 55306.847 | 281161.865 | 1.092 | 0.197 | | (16, 16, 4096, 256) | causal_mask | torch.bfloat16 | 15326.592 | 14263.878 | 55227.806 | 313063.232 | 1.075 | 0.176 | | (16, 16, 4096, 256) | relative_bias | torch.bfloat16 | 15297.963 | 14007.379 | 54558.029 | 279529.175 | 1.092 | 0.195 | | (16, 16, 4096, 256) | head_bias | torch.bfloat16 | 15216.160 | 14276.027 | 55081.581 | 280996.826 | 1.066 | 0.196 | </details> Pull Request resolved: pytorch#125515 Approved by: https://github.com/Chillee
Configuration menu - View commit details
-
Copy full SHA for 7cea4a5 - Browse repository at this point
Copy the full SHA 7cea4a5View commit details -
Fix documentation for register_fake_class (pytorch#126422)
Pull Request resolved: pytorch#126422 Approved by: https://github.com/angelayi
Configuration menu - View commit details
-
Copy full SHA for 814dbc7 - Browse repository at this point
Copy the full SHA 814dbc7View commit details -
[export] Delete predispatch tests (pytorch#126459)
Deleting predispatch tests as we moved export to predispatch already Pull Request resolved: pytorch#126459 Approved by: https://github.com/tugsbayasgalan
Configuration menu - View commit details
-
Copy full SHA for 65f4d4f - Browse repository at this point
Copy the full SHA 65f4d4fView commit details -
[DeviceMesh] Supported N groups in
from_group
(pytorch#126258)**Overview** This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise). This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience. <details> <summary> Old Approach </summary> **Overview** - This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.) - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general. - This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh. </details> Pull Request resolved: pytorch#126258 Approved by: https://github.com/wanchaol
Configuration menu - View commit details
-
Copy full SHA for b6d8201 - Browse repository at this point
Copy the full SHA b6d8201View commit details -
[easy] Fix typing for
map_location
docs in torch.load (pytorch#125473)Currently it incorrectly has `Callable[[Tensor, str], Tensor]` as a possible type signature, this should be `Callable[[Storage, str], Storage]` <img width="716" alt="Screenshot 2024-05-03 at 12 09 54 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b8946f95-8297-445f-a9d9-570b8a3caab1"> Pull Request resolved: pytorch#125473 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for b9da19d - Browse repository at this point
Copy the full SHA b9da19dView commit details -
[doc] expose torch.Tensor.xpu API to doc (pytorch#126383)
# Motivation The doc string related `torch.Tensor.xpu` has been added [here](https://github.com/pytorch/pytorch/blob/d61a81a9e76688ac8f338a6cfba932bf7779e5ce/torch/_tensor_docs.py#L1434) but not expose it to public doc, like [torch.Tensor.cuda](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda). This PR intends to expose the document of `torch.Tensor.xpu` to public doc. Pull Request resolved: pytorch#126383 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for 22b4b22 - Browse repository at this point
Copy the full SHA 22b4b22View commit details -
Add symbolic_shape_specialization structured trace (pytorch#126450)
This is typically the information you want when diagnosing why something overspecialized in dynamic shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126450 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for 6fc8524 - Browse repository at this point
Copy the full SHA 6fc8524View commit details -
Make inductor scheduler graph extension configurable (pytorch#125578)
This patch makes the inductor scheduler graph extension configurable. It enables ease of debugging by changing the graph format (dot, png, etc.). Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz Pull Request resolved: pytorch#125578 Approved by: https://github.com/Chillee
Configuration menu - View commit details
-
Copy full SHA for 54ce306 - Browse repository at this point
Copy the full SHA 54ce306View commit details -
[FSDP2][Test] Fix _test_clip_grad_norm (pytorch#126457)
Fixes #ISSUE_NUMBER We need to compare ref_total_norm to total_norm.full_tensor(). Example: ``` iter_idx:0, rank:0,\ ref_total_norm=tensor(1052.5934, device='cuda:0'),\ total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\ total_norm.full_tensor()=tensor(1052.5934, device='cuda:0') ``` Pull Request resolved: pytorch#126457 Approved by: https://github.com/awgu
Configuration menu - View commit details
-
Copy full SHA for 0f31e61 - Browse repository at this point
Copy the full SHA 0f31e61View commit details -
dont pad 0 dim mm inputs (pytorch#126475)
Otherwise you get an error in constant_pad_nd. Pull Request resolved: pytorch#126475 Approved by: https://github.com/huydhn ghstack dependencies: pytorch#125772, pytorch#125773, pytorch#125780
Configuration menu - View commit details
-
Copy full SHA for 2cbbe21 - Browse repository at this point
Copy the full SHA 2cbbe21View commit details -
c10d: add Collectives abstraction (pytorch#125978)
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives. Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR. Test plan: ``` python test/distributed/test_collectives.py -v ``` This tests both functionality using multiple threads as well as timeout behavior. Pull Request resolved: pytorch#125978 Approved by: https://github.com/shuqiangzhang
Configuration menu - View commit details
-
Copy full SHA for a05c0fa - Browse repository at this point
Copy the full SHA a05c0faView commit details -
Add dist_pp shortcut to TORCH_LOGS (pytorch#126322)
distributed log category already includes pipelining since its under the torch.distributed umbrella. So both TORCH_LOGS=distributed and TORCH_LOGS=dist_pp will enable PP logs. Pull Request resolved: pytorch#126322 Approved by: https://github.com/kwen2501
Configuration menu - View commit details
-
Copy full SHA for ae7ee03 - Browse repository at this point
Copy the full SHA ae7ee03View commit details -
[dtensor] refactor view ops to use OpStrategy (pytorch#126011)
As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`). Pull Request resolved: pytorch#126011 Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Configuration menu - View commit details
-
Copy full SHA for c61bdbf - Browse repository at this point
Copy the full SHA c61bdbfView commit details -
[XPU] call empty_cache for dynamo tests (pytorch#126377)
When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models. This PR unifies the `empty_cache` call for both CUDA and XPU. Pull Request resolved: pytorch#126377 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
Configuration menu - View commit details
-
Copy full SHA for b1770bd - Browse repository at this point
Copy the full SHA b1770bdView commit details -
Refactor partitioner and clean it up (pytorch#126318)
Pull Request resolved: pytorch#126318 Approved by: https://github.com/anijain2305
Configuration menu - View commit details
-
Copy full SHA for c271827 - Browse repository at this point
Copy the full SHA c271827View commit details -
[DTensor] Turn on foreach implementation for clip_grad_norm_ for DTen…
…sor by default (pytorch#126423) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#126423 Approved by: https://github.com/awgu
Configuration menu - View commit details
-
Copy full SHA for 99190da - Browse repository at this point
Copy the full SHA 99190daView commit details -
Fix cummax and cummin lowering for empty case (pytorch#126461)
Pull Request resolved: pytorch#126461 Approved by: https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for 8221d3d - Browse repository at this point
Copy the full SHA 8221d3dView commit details -
[Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion fo…
…r X86Inductor (pytorch#122593) **Description** Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add. **Supported linear-binary(-unary) patterns** ``` linear(X) extra input \ / Add | Optional(relu) | Y 1. int8-mixed-fp32 +---+---------------+-----------+------------------------------+---------+ | # | Add type | Quant out | Pattern | Post op | +---+---------------+-----------+------------------------------+---------+ | 1 | In-/out-place | Yes | linear + fp32 -> (relu) -> q | add | +---+---------------+-----------+------------------------------+---------+ | 2 | In-/out-place | No | linear + fp32 -> (relu) | sum | +---+---------------+-----------+------------------------------+---------+ 2. int8-mixed-bf16 +---+----------+---------------+-----------+--------------------------------------------------+---------+ | # | X2 dtype | Add type | Quant out | Pattern | Post op | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 1 | BF16 | In-/out-place | Yes | linear + bf16 -> (relu) -> to_fp32 -> q | add | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 2 | BF16 | In-/out-place | No | linear + bf16 -> (relu) | sum | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 3 | FP32 | Out-place | Yes | linear + fp32 -> (relu) -> q | add | | | | In-place right| | | | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 4 | FP32 | Out-place | No | linear + fp32 -> (relu) | sum | | | | In-place right| | | | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 5 | FP32 | In-place left | Yes | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add | +---+----------+---------------+-----------+--------------------------------------------------+---------+ | 6 | FP32 | In-place left | No | linear + fp32 -> to_bf16 -> (relu) | add | +---+----------+---------------+-----------+--------------------------------------------------+---------+ ``` Note (1) The positions of linear and the extra input can be swapped. (2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the extra input, we don't match that pattern because we cannot match all these patterns in 3 passes. **Test plan** python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add Pull Request resolved: pytorch#122593 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
Configuration menu - View commit details
-
Copy full SHA for 747bdea - Browse repository at this point
Copy the full SHA 747bdeaView commit details -
variable search spaces for gemm autotuning (pytorch#126220)
add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6] Pull Request resolved: pytorch#126220 Approved by: https://github.com/eellison
Configuration menu - View commit details
-
Copy full SHA for f55c0cc - Browse repository at this point
Copy the full SHA f55c0ccView commit details -
save the reciprocal of weights for welford_reduce (pytorch#125148)
Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel. Generated code: - Before: ``` for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } ``` - After:: ``` static WeightRecp<at::vec::Vectorized<float>> weight_recps(64); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } ``` Performance: - Single core: Op | shape | eager/ms | inductor/ms | optimized inductor/ms -- | -- | -- | -- | -- layernorm | (56, 384, 1024) | 16.825 | 22.338 | 15.208 var | (56, 384, 1024) | 21.752 | 13.258 | 13.102 - 4 cores: Op | shape | eager/ms | inductor/ms | optimized inductor/ms -- | -- | -- | -- | -- layernorm | (56, 384, 1024) | 4.249 | 5.899 | 4.223 var | (56, 384, 1024) | 5.3152 | 3.278 | 2.163 Pull Request resolved: pytorch#125148 Approved by: https://github.com/jgong5, https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for ae3c9ca - Browse repository at this point
Copy the full SHA ae3c9caView commit details -
[Submodule] Remove zstd dependency (pytorch#126485)
After searching in the codebase, it seems that zstd is not in use now. Pull Request resolved: pytorch#126485 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 9882241 - Browse repository at this point
Copy the full SHA 9882241View commit details -
Update ops handler documentation some more (pytorch#126480)
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126480 Approved by: https://github.com/peterbell10 ghstack dependencies: pytorch#126292, pytorch#126299
Configuration menu - View commit details
-
Copy full SHA for 7263893 - Browse repository at this point
Copy the full SHA 7263893View commit details -
[FSDP2] Fixed 2D clip grad norm test (pytorch#126497)
This fixes pytorch#126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. Pull Request resolved: pytorch#126497 Approved by: https://github.com/weifengpy, https://github.com/wz337
Configuration menu - View commit details
-
Copy full SHA for 9a47caa - Browse repository at this point
Copy the full SHA 9a47caaView commit details -
Default to env variable instead of config value for precompile parall…
…elism (pytorch#126333) Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks. This changes so we default it to use as many threads as needed unless the env variable is set. Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023) Pull Request resolved: pytorch#126333 Approved by: https://github.com/nmacchioni
Configuration menu - View commit details
-
Copy full SHA for be7b65a - Browse repository at this point
Copy the full SHA be7b65aView commit details -
Delete refactored function, move changes over (pytorch#126407)
Oops, in pytorch#125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. pytorch#126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended. Pull Request resolved: pytorch#126407 Approved by: https://github.com/eellison
Configuration menu - View commit details
-
Copy full SHA for 3f1ccfd - Browse repository at this point
Copy the full SHA 3f1ccfdView commit details -
[optim] Fix: wrong ASGD implementation (pytorch#126375)
This PR is based on pytorch#125440, additionally merging the latest main branch and fixing the lint failures from pytorch#126361. Pull Request resolved: pytorch#126375 Approved by: https://github.com/janeyx99
Configuration menu - View commit details
-
Copy full SHA for e1a0676 - Browse repository at this point
Copy the full SHA e1a0676View commit details -
Early return in _recursive_build if obj is a Tensor (pytorch#125639)
Fix issue pytorch#125551 Pull Request resolved: pytorch#125639 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 0be8b0f - Browse repository at this point
Copy the full SHA 0be8b0fView commit details -
Remove removed ruff rule TRY200 (pytorch#126256)
My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema. From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/ > This rule has been removed and its documentation is only available for historical reasons. > > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead. and we are currently explicitly ignoring B904. Pull Request resolved: pytorch#126256 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for bd10ff6 - Browse repository at this point
Copy the full SHA bd10ff6View commit details -
[Perf] Vectorize more dtype for int4mm (pytorch#126512)
It used to be vectorized only for f16, but no reason not to do the same for bf16 or f32 Spiritual followup of pytorch#125290 Pull Request resolved: pytorch#126512 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for e24f7b3 - Browse repository at this point
Copy the full SHA e24f7b3View commit details -
[inductor] fix unbacked case in pointwise + reduction vertical fusion (…
…pytorch#125982) ``` $ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once for node1, node2 in self.get_possible_fusions(): File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions check_all_pairs(node_grouping) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs if self.can_fuse(node1, node2): File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse return self.get_backend(device).can_fuse_vertical(node1, node2) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical return self._triton_scheduling.can_fuse_vertical(node1, node2) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse if not all( File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr> TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges()) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible cls._split_iteration_ranges(groups, lengths) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1: File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint return int(out) File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__ raise TypeError("Cannot convert symbols to int") torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: Cannot convert symbols to int ``` Where the unbacked symints show up at. ``` > /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges() (Pdb) print(groups) (1, 512*u0) (Pdb) print(lengths) ([u0, 32, 16], []) ``` Pull Request resolved: pytorch#125982 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for bb5e037 - Browse repository at this point
Copy the full SHA bb5e037View commit details -
Workflow for uploading additional test stats on workflow dispatch (py…
…torch#126080) This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info Add workflow that is callable via workflow dispatch for uploading additional test stats Adds script that only calculates the additional info Pull Request resolved: pytorch#126080 Approved by: https://github.com/ZainRizvi
Configuration menu - View commit details
-
Copy full SHA for 45a8ba4 - Browse repository at this point
Copy the full SHA 45a8ba4View commit details -
Allow tensor subclasses and add `torch.serialization.add_safe_globals…
…` that allows users to allowlist classes for `weights_only` load (pytorch#124331) #### Conditions for allowlisting tensor subclasses We allow tensor subclasses types that (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`) (2) Use the generic `tp_alloc` (3) Are in a module that *has been imported by the user* to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2` *Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution. The rationale for the 3 conditions above is as follows: The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`) https://github.com/pytorch/pytorch/blob/4e66aaa01092ddc8822bbca315b673329c76f4cd/torch/_tensor.py#L57-L71 `as_subclass` is implemented with a call to `THPVariable_NewWithVar` that will eventually call `tp_alloc` here https://github.com/pytorch/pytorch/blob/4e66aaa01092ddc8822bbca315b673329c76f4cd/torch/csrc/autograd/python_variable.cpp#L2053 The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc` **Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling** ### How do we check something is a tensor subclass/constraints around imports In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)` This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`) ### API for allow listing This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe). Next steps: - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.) Pull Request resolved: pytorch#124331 Approved by: https://github.com/albanD
Configuration menu - View commit details
-
Copy full SHA for d0d2d0b - Browse repository at this point
Copy the full SHA d0d2d0bView commit details -
Enable FX graph cache for huggingface and timm benchmarks (pytorch#12…
…6205) Pull Request resolved: pytorch#126205 Approved by: https://github.com/eellison
Configuration menu - View commit details
-
Copy full SHA for 39f5adb - Browse repository at this point
Copy the full SHA 39f5adbView commit details -
[quant][pt2e] Allow multi users without output observers (pytorch#126487
) Summary: The PT2E quantization flow does not support unquantized outputs yet. To work around this, users may wish to remove the output observer from their graphs. However, this fails currently in some cases because the `PortNodeMetaForQDQ` pass is too restrictive, for example: ``` conv -> obs -------> output0 \\-> add -> output1 ``` Previously we expected conv to always have exactly 1 user, which is the observer. When the observer is removed, however, conv now has 2 users, and this fails the check. ``` conv -------> output0 \\-> add -> output1 ``` This commit relaxes the error into a warning to enable this workaround. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer Reviewers: jerryzh168 Subscribers: jerryzh168, supriyar Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601) Pull Request resolved: pytorch#126487 Approved by: https://github.com/tarun292
Configuration menu - View commit details
-
Copy full SHA for 218756f - Browse repository at this point
Copy the full SHA 218756fView commit details -
Add coms metadata to execution trace (ET) (pytorch#126317)
Add Execution Trace communication collective meta data. For specification see pytorch#124674 New fields look like ``` { "id": 80, "name": "record_param_comms", "ctrl_deps": 79, "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]}, "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]}, "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""}, {"name": "collective_name", "type": "string", "value": "allreduce"}, {"name": "dtype", "type": "string", "value": "Float"}, {"name": "in_msg_nelems", "type": "uint64", "value": 100}, {"name": "out_msg_nelems", "type": "uint64", "value": 100}, {"name": "in_split_size", "type": "string", "value": "[]"}, {"name": "out_split_size", "type": "string", "value": "[]"}, {"name": "global_rank_start", "type": "uint64", "value": 0}, {"name": "global_rank_stride", "type": "uint64", "value": 1}, {"name": "pg_name", "type": "string", "value": "0"}, {"name": "pg_desc", "type": "string", "value": "default_pg"}, {"name": "pg_size", "type": "uint64", "value": 2}] } ``` ## Unit Test Added a new unit test to check the execution trace collected has right attributes `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace` ``` STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing [rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer [rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up [rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection [rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Execution trace saved at /tmp/tmpy01ngc3w.et.json Execution trace saved at /tmp/tmptf8543k4.et.json ok ---------------------------------------------------------------------- ``` Also run profilerunit test `touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler` ``` STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing [rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing Trace saved to /tmp/tmpdrw_cmcu.json Trace saved to /tmp/tmpnio7ec9j.json ok ---------------------------------------------------------------------- Ran 1 test in 19.772s OK ``` Pull Request resolved: pytorch#126317 Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise
Configuration menu - View commit details
-
Copy full SHA for 45a3349 - Browse repository at this point
Copy the full SHA 45a3349View commit details -
Revert "Remove redundant serialization code (pytorch#126249)"
This reverts commit aab448e. Reverted pytorch#126249 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing sigmoid/frontend:serialization_test internally ([comment](pytorch#126249 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 2f044a8 - Browse repository at this point
Copy the full SHA 2f044a8View commit details -
Revert "Fix aarch64 debug build with GCC (pytorch#126290)"
This reverts commit 91bf952. Reverted pytorch#126290 on behalf of https://github.com/huydhn due to There seems to be a mis-match closing curly bracket here and it breaks some internal build in D57474505 ([comment](pytorch#126290 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 02bf7e2 - Browse repository at this point
Copy the full SHA 02bf7e2View commit details -
Initial implementation of AdaRound (pytorch#126153)
Summary: This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568 This algorithm is going to be used by multiple people, hence we need make it official implementation. Differential Revision: D57227565 Pull Request resolved: pytorch#126153 Approved by: https://github.com/jerryzh168, https://github.com/huydhn
Configuration menu - View commit details
-
Copy full SHA for b2aff20 - Browse repository at this point
Copy the full SHA b2aff20View commit details -
[distributed] Add cpp-httplib to pytorch (pytorch#126470)
Adds https://github.com/yhirose/cpp-httplib such that we are able to use https for host to host communication in distributed (specifically torchrun) Todo: We likely need to add cpp-httplib somewhere in the build (cmake/bazel) but first we should write the code for it. Pull Request resolved: pytorch#126470 Approved by: https://github.com/d4l3k, https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 782792b - Browse repository at this point
Copy the full SHA 782792bView commit details -
[BE][Ez]: Use NotADirectoryError in tensorboard writer (pytorch#126534)
Slightly improve exception typing for tensorboard wrriter Pull Request resolved: pytorch#126534 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 5182e2e - Browse repository at this point
Copy the full SHA 5182e2eView commit details -
Revert "[FSDP2] Fixed 2D clip grad norm test (pytorch#126497)"
This reverts commit 3f28906. Reverted pytorch#126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](pytorch#126497 (comment)))
Configuration menu - View commit details
-
Copy full SHA for c81bf77 - Browse repository at this point
Copy the full SHA c81bf77View commit details -
[ROCm] enable faster_load_save for Fused_SGD (pytorch#125456)
Reopen due to rebase error. Fixes pytorch#117599 The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh ``` :4:command.cpp :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070 :4:rocvirtual.cpp :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00 :3:rocvirtual.hpp :66 : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns ``` Pull Request resolved: pytorch#125456 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
Configuration menu - View commit details
-
Copy full SHA for 6372770 - Browse repository at this point
Copy the full SHA 6372770View commit details -
Experimental prototype for converting torch.jit.trace modules to expo…
…rt (pytorch#124449) Differential Revision: [D56440613](https://our.internmc.facebook.com/intern/diff/D56440613) We want to do this for following reasons: 1. There is current limitation in export tracing for torch.jit.trace d modules that cannot be easily upstreamed 2. We need to run internal CI regularly to understand feature gaps and continuously track them 3. Multiple people will be working on this prototype so it is better to have a checked in version so we don't always run into merge conflicts. Pull Request resolved: pytorch#124449 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
Configuration menu - View commit details
-
Copy full SHA for 04c3751 - Browse repository at this point
Copy the full SHA 04c3751View commit details -
Disable vulkan test batch_norm_invalid_inputs (pytorch#126571)
Fails flakily ex https://github.com/pytorch/pytorch/actions/runs/9130802617/job/25109131748 https://github.com/pytorch/pytorch/actions/runs/9125548571/job/25092535707 First bad I can find is https://hud.pytorch.org/pytorch/pytorch/commit/538877d2046a492a1112101e2d5d88e5754d477b Pull Request resolved: pytorch#126571 Approved by: https://github.com/SS-JIA
Configuration menu - View commit details
-
Copy full SHA for a1245dd - Browse repository at this point
Copy the full SHA a1245ddView commit details -
[AOTI] config target platform (pytorch#126306)
Test Plan: AOTI compile stories15M for Android Differential Revision: D57392830 Pull Request resolved: pytorch#126306 Approved by: https://github.com/desertfire
Configuration menu - View commit details
-
Copy full SHA for 68a6cdd - Browse repository at this point
Copy the full SHA 68a6cddView commit details -
Fix issue of lowering nn.linear ops with kwargs (pytorch#126331)
Summary: Support kwarg bias for nn.linear quantization Differential Revision: D57403190 Pull Request resolved: pytorch#126331 Approved by: https://github.com/ZhengkaiZ, https://github.com/huydhn
Configuration menu - View commit details
-
Copy full SHA for a6235d0 - Browse repository at this point
Copy the full SHA a6235d0View commit details -
[inductor] Load python modules using importlib (pytorch#126454)
The `compile` + `exec` workflow is susceptible to behavior drifting from a "normal" import use importlib instead to avoid this. In particular here annotations were being stored as strings due to `from __futures__ import annotations` in the scope calling `compile`. Triton cares about annotations on global variables and this makes it much easier to reliably code-gen them. Pull Request resolved: pytorch#126454 Approved by: https://github.com/peterbell10
Configuration menu - View commit details
-
Copy full SHA for 6e4ed6c - Browse repository at this point
Copy the full SHA 6e4ed6cView commit details -
[dynamo] Sourceless builder - ordered dict and re.pattern (pytorch#12…
…6468) Pull Request resolved: pytorch#126468 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for edbd215 - Browse repository at this point
Copy the full SHA edbd215View commit details -
Added error checks for invalid inputs on thnn_conv2d (pytorch#121906)
Fixes pytorch#121188 Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d' Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format. Additionally, this commit includes tests to cover the three referenced cases. Pull Request resolved: pytorch#121906 Approved by: https://github.com/janeyx99
Configuration menu - View commit details
-
Copy full SHA for 6708519 - Browse repository at this point
Copy the full SHA 6708519View commit details -
Fix aarch64 debug build with GCC (pytorch#126290)
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere
Configuration menu - View commit details
-
Copy full SHA for 38a85b2 - Browse repository at this point
Copy the full SHA 38a85b2View commit details -
Remove dist_ prefix from TORCH_LOGS shortcuts (pytorch#126499)
e.g. dist_ddp -> ddp 'distributed' shortcut remains unchained Feedback has been that it is not appealing to have the dist_ prefix, and the main reason for it was to keep the distributed shortcuts grouped together in the help menu. It's nice to have shorter shortcuts. Pull Request resolved: pytorch#126499 Approved by: https://github.com/XilunWu, https://github.com/kwen2501 ghstack dependencies: pytorch#126322
Configuration menu - View commit details
-
Copy full SHA for 8ab08f9 - Browse repository at this point
Copy the full SHA 8ab08f9View commit details -
Tool for scouting exportability in one shot (pytorch#126471)
Summary: Tool for scouting exportability issues in one shot. - Collect sample inputs for all submodules by running eager inference with forward_pre_hook. - Start from root module, recursively try exporting child modules, if current module export fails. Limitations: - only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule. TODO: support dynamic_dims Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing ``` exportability_report = { '': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>), 'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>), 'submod_2': None } ``` Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools Differential Revision: D57466486 Pull Request resolved: pytorch#126471 Approved by: https://github.com/zhxchen17
Configuration menu - View commit details
-
Copy full SHA for bd786d8 - Browse repository at this point
Copy the full SHA bd786d8View commit details -
[torch-distributed] Make log directory creation idempotent (pytorch#1…
…26496) Summary: https://docs.python.org/3/library/os.html#os.makedirs > If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists. Test Plan: Existing tests Differential Revision: D57471577 Pull Request resolved: pytorch#126496 Approved by: https://github.com/d4l3k
Configuration menu - View commit details
-
Copy full SHA for 4de26b7 - Browse repository at this point
Copy the full SHA 4de26b7View commit details -
[AOTI] Flag to include aoti sources when building lite interpreter (p…
…ytorch#126572) Summary: Added USE_LITE_AOTI cmake flag, which is turned OFF by default. When it is turned on, the AOTI sources (inductor_core_resources) are included when building lite interpreter Test Plan: ``` ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON ``` Differential Revision: D57394078 Pull Request resolved: pytorch#126572 Approved by: https://github.com/malfet
Configuration menu - View commit details
-
Copy full SHA for fbf8018 - Browse repository at this point
Copy the full SHA fbf8018View commit details -
[Pipelining] Fix 1f1b schedule (pytorch#126419)
This schedule was running fine locally but failing (hanging) on CI. After analysis (https://fburl.com/gdoc/xt80h1gd), it seems like the schedule was not correct previously but may still work depending on the runtime. The fix bundles together fwd-recv(s->s+1) and bwd-send(s+1->s) into one coalesced group so they would not block each other. Design drawing <img width="803" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/906a9a66-39ae-4a6a-bc1a-18b77eaaa784"> Flight recorder traces show the same coalescing pattern as designed <img width="1013" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/ab10646e-eaef-4191-83dd-73f448876c27"> Pull Request resolved: pytorch#126419 Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
Configuration menu - View commit details
-
Copy full SHA for 492ef49 - Browse repository at this point
Copy the full SHA 492ef49View commit details -
[C10D] Add __repr__ to P2POp class (pytorch#126538)
Pull Request resolved: pytorch#126538 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/c-p-i-o ghstack dependencies: pytorch#126419
Configuration menu - View commit details
-
Copy full SHA for b6caa15 - Browse repository at this point
Copy the full SHA b6caa15View commit details -
gitmodules: switch cpp-httplib to https (pytorch#126580)
Fixes issue introduced in pytorch#126470 (comment) Test plan: CI Pull Request resolved: pytorch#126580 Approved by: https://github.com/PaliC, https://github.com/jeffdaily
Configuration menu - View commit details
-
Copy full SHA for d288e44 - Browse repository at this point
Copy the full SHA d288e44View commit details -
[pipelining] Follow improvements in export.unflatten (pytorch#126217)
Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py. But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs. Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy. Pull Request resolved: pytorch#126217 Approved by: https://github.com/H-Huang
Configuration menu - View commit details
-
Copy full SHA for 68ff312 - Browse repository at this point
Copy the full SHA 68ff312View commit details -
[Submodule] Remove third-party CUB (pytorch#126540)
Because it was updated 4 years ago, and now all supported CUDA versions provide CUB. Pull Request resolved: pytorch#126540 Approved by: https://github.com/Skylion007
Configuration menu - View commit details
-
Copy full SHA for 743df86 - Browse repository at this point
Copy the full SHA 743df86View commit details -
[halide-backend] Refactor codegen/triton.py into codegen/simd.py (pyt…
…orch#126415) This PR is primarily just moving stuff around. It creates a new common baseclass for TritonCodegen and the (upcoming) HalideCodegen. Pull Request resolved: pytorch#126415 Approved by: https://github.com/shunting314
Configuration menu - View commit details
-
Copy full SHA for deb6f3f - Browse repository at this point
Copy the full SHA deb6f3fView commit details -
Faster(?) FP16 gemv kernel (pytorch#126297)
Differential Revision: [D57369266](https://our.internmc.facebook.com/intern/diff/D57369266/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D57369266/)! Pull Request resolved: pytorch#126297 Approved by: https://github.com/malfet
Configuration menu - View commit details
-
Copy full SHA for 8a7f719 - Browse repository at this point
Copy the full SHA 8a7f719View commit details -
[2/N] Non-Tensor: Scalar Support: Add scalar to the cache for eager-t…
…hrough-torch.compile (pytorch#124070) Add scalar information to the kernel configuration. #### Additional Context Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading. However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs. Pull Request resolved: pytorch#124070 Approved by: https://github.com/jansel, https://github.com/jgong5
Configuration menu - View commit details
-
Copy full SHA for b51e6dd - Browse repository at this point
Copy the full SHA b51e6ddView commit details -
Map float8 types to uint8 for allgather (pytorch#126556)
# Summary Different take on this one: pytorch#126338 We should probably not allow this mapping for 'compute' ops e.g. reductions ### Corresponding fp8 PR pytorch-labs/float8_experimental#263 Pull Request resolved: pytorch#126556 Approved by: https://github.com/wanchaol
Configuration menu - View commit details
-
Copy full SHA for 23b6ebd - Browse repository at this point
Copy the full SHA 23b6ebdView commit details -
[Traceable FSDP2] Change from register_multi_grad_hook to per-tensor …
…backward hook (pytorch#126350) As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~ As discussed below, we will change eager path to use per-tensor backward hook as well. Pull Request resolved: pytorch#126350 Approved by: https://github.com/awgu
Configuration menu - View commit details
-
Copy full SHA for b4a2288 - Browse repository at this point
Copy the full SHA b4a2288View commit details -
[Dynamo] Treat integers stored on nn.Modules as dynamic (pytorch#126466)
Fixes pytorch#115711 Pull Request resolved: pytorch#126466 Approved by: https://github.com/jansel
Configuration menu - View commit details
-
Copy full SHA for b10f3dd - Browse repository at this point
Copy the full SHA b10f3ddView commit details -
Refactor variables / function names related to non-strict export (pyt…
…orch#126458) Improve variable and function naming for better clarity: `non strict` --> `aten`. Pull Request resolved: pytorch#126458 Approved by: https://github.com/angelayi
Configuration menu - View commit details
-
Copy full SHA for d82bbb0 - Browse repository at this point
Copy the full SHA d82bbb0View commit details -
Updated test_torch.py to use new OptimizerInfo infrastructure (pytorc…
…h#125538) Fixes pytorch#123451 (only addresses test_torch.py cases) This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure. I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Pull Request resolved: pytorch#125538 Approved by: https://github.com/janeyx99
Configuration menu - View commit details
-
Copy full SHA for 6ed6142 - Browse repository at this point
Copy the full SHA 6ed6142View commit details -
Forward fix the failed new test from D57474327 (pytorch#126596)
Summary: TSIA. The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used: ``` _________________________ ReproTests.test_issue126128 __________________________ self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128> def test_issue126128(self): def fn(): x = torch.randn(1, 10) y = torch.randn(10, 1) return torch.mm(x, y).sum() def fn2(): x = torch.randn(10, 100) y = torch.randn(100, 10) return torch.mm(x, y).sum() > with torch._inductor.utils.fresh_inductor_cache(): E AttributeError: module 'torch._inductor' has no attribute 'utils' ``` Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'` Differential Revision: D57516676 Pull Request resolved: pytorch#126596 Approved by: https://github.com/xmfan
Configuration menu - View commit details
-
Copy full SHA for 0e59bd4 - Browse repository at this point
Copy the full SHA 0e59bd4View commit details -
Cached required_fw_nodes creation (pytorch#126613)
Pull Request resolved: pytorch#126613 Approved by: https://github.com/anijain2305
Configuration menu - View commit details
-
Copy full SHA for 367a0c5 - Browse repository at this point
Copy the full SHA 367a0c5View commit details -
Revert "[Dynamo] Treat integers stored on nn.Modules as dynamic (pyto…
…rch#126466)" This reverts commit 6bb9d60. Reverted pytorch#126466 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the ONNX test failure looks legit, not flaky, as it starts failing in trunk https://hud.pytorch.org/pytorch/pytorch/commit/6bb9d6080d33c817fcbf9e5ae8a59b76812a53d2 ([comment](pytorch#126466 (comment)))
Configuration menu - View commit details
-
Copy full SHA for 0ac2cec - Browse repository at this point
Copy the full SHA 0ac2cecView commit details -
Remove unnecessary implementations from MockHandler (pytorch#126511)
Dead implementations are confusing and can cause bugs when people accidentally hit them. Better for it to be missing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126511 Approved by: https://github.com/peterbell10, https://github.com/lezcano
Configuration menu - View commit details
-
Copy full SHA for 197ebc5 - Browse repository at this point
Copy the full SHA 197ebc5View commit details -
UFMT torch.utils._sympy.functions (pytorch#126553)
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126553 Approved by: https://github.com/lezcano, https://github.com/Skylion007 ghstack dependencies: pytorch#126511
Configuration menu - View commit details
-
Copy full SHA for 2d65795 - Browse repository at this point
Copy the full SHA 2d65795View commit details -
Update hf_BirdBird periodic-dynamo-benchmarks results (pytorch#126414)
can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing Pull Request resolved: pytorch#126414 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 0d1108c - Browse repository at this point
Copy the full SHA 0d1108cView commit details -
Replace torch.library.impl_abstract with torch.library.register_fake (p…
…ytorch#126606) To remove the disrupting warning ``` warnings.warn("torch.library.impl_abstract was renamed to " "torch.library.register_fake. Please use that instead; " "we will remove torch.library.impl_abstract in a future " "version of PyTorch.", DeprecationWarning, stacklevel=2) ``` Pull Request resolved: pytorch#126606 Approved by: https://github.com/ezyang
Configuration menu - View commit details
-
Copy full SHA for 454d0d4 - Browse repository at this point
Copy the full SHA 454d0d4View commit details