Skip to content

Releases: pytorch/torchrec


25 Apr 01:39
Choose a tag to compare

No major features in this release


  • Expanding out ZCH/MCH
  • Increased support with Torch Dynamo/Export
  • Distributed Benchmarking introduced under torchrec/distributed/benchmarks for inference and training
  • VBE optimizations
  • TWRW support for VBE (I think this happened in the last release, Josh can confirm)
  • Generalized train_pipeline for different pipeline stage overlapping
  • Autograd support for traceable collectives
  • Output dtype support for embeddings
  • Dynamo tracing for sharded embedding modules
  • Bug fixes


18 Mar 18:38
Choose a tag to compare
v0.7.0-rc1 Pre-release

Pre release for v0.7.0


30 Jan 23:40
Choose a tag to compare


TorchRec now natively supports VBE (variable batched embeddings) within the EmbeddingBagCollection module. This allows variable batch size per feature, unlocking sparse input data deduplication, which can greatly speed up embedding lookup and all-to-all time. To enable, simply initialize KeyedJaggedTensor with stride_per_key_per_rank and inverse_indices fields, which specify batch size per feature and inverse indices to reindex the embedding output respectively.

Embedding offloading

Embedding offloading is UVM caching (i.e. storing embedding tables on host memory with cache on HBM memory) plus prefetching and optimal sizing of cache. Embedding offloading would allow running a larger model with fewer GPUs, while maintaining competitive performance. To use, one needs to use the prefetching pipeline (PrefetchTrainPipelineSparseDist) and pass in per table cache load factor and the prefetch_pipeline flag through constraints in the planner.


These APIs replace embedding submodules with its sharded variant. The shard API applies to an individual embedding module while the shard_modules API replaces all embedding modules and won’t touch other non-embedding submodules.
Embedding sharding follows similar behavior to the prior TorchRec DistributedModuleParallel behavior, except the ShardedModules have been made composable, meaning the modules are backed by TableBatchedEmbeddingSlices which are views into the underlying TBE (including .grad). This means that fused parameters are now returned with named_parameters(), including in DistributedModuleParallel.


22 Jan 19:38
Choose a tag to compare
v0.6.0-rc2 Pre-release



18 Dec 18:05
Choose a tag to compare
v0.6.0-rc1 Pre-release

This should support python 3.8 - 3.11 and 3.12 (experimental)

pip install torchrec --index-url
pip install torchrec --index-url
pip install torchrec --index-url


05 Oct 17:21
Choose a tag to compare

[Prototype] Zero Collision / Managed Collision Embedding Bags

A common constraint in Recommender Systems is the sparse id input range is larger than the number of embeddings the model can learn for a given parameter size. To resolve this issue, the conventional solution is to hash sparse ids into the same size range as the embedding table. This will ultimately lead to hash collisions, with multiple sparse ids sharing the same embedding space. We have developed a performant alternative algorithm that attempts to address this problem by tracking the N most common sparse ids and ensuring that they have a unique embedding representation. The module is defined here and an example can be found here.

[Prototype] UVM Caching - Prefetch Training Pipeline

For tables where on-device memory is insufficient to hold the entire embedding table, it is common to leverage a caching architecture where part of the embedding table is cached on device and the full embedding table is on host memory (typically DDR SDRAM). However, in practice, caching misses are common, and hurt performance due to relatively high latency of going to host memory. Building on TorchRec’s existing data pipelining, we developed a new Prefetch Training Pipeline to avoid these cache misses by prefetching the relevant embeddings for upcoming batch from host memory, effectively eliminating cache misses in the forward path.


03 Oct 22:48
Choose a tag to compare
v0.5.0-rc2 Pre-release

Install fbgemm via nova


13 Sep 01:15
Choose a tag to compare
v0.5.0-rc1 Pre-release
remove fbgemm-gpu-nightly instead


15 Mar 20:48
Choose a tag to compare

Train pipeline improvements

The train pipeline now allows the user to specify if they want all pipelined batches to be executed after exhausting the dataloader iterator. Normally when StopIteration is raised, the train pipeline will halt with the last 2 pipelined batches yet to be executed.
Core train pipeline logic has been refactored for better readability and maintainability.
The memcpy and data_dist streams have been set to high priority. We’ve seen kernel launches get delayed scheduling even with nothing on the GPU blocking the kernel. This will block the CPU unnecessarily, and we see perf gains after making this change.

FX + Script Inference Module

Sharded Quantized EmbeddingBagCollection and EmbeddingCollection are now torch.fx-able and torchscript-able ( via torch.script(torch.fx(module)) ), and can now be run with torchscript.


Add include_logloss option to NE metric, to return the log of cross entropy loss on top of NE.
Add grouped AUC metric option. To use, toggle grouped_auc=True when instantiating AUC metric, and provide an additional grouping_keys tensor to specify the group_id for each element along the batch dimension in update method. The grouped AUC will then calculate AUCs per specified group, and return the averaged AUC.

Enable the grouped_auc during metric instantiation
auc = AUCMetric(world_size=4, my_rank=0, batch_size=64, tasks=["t1"], grouped_auc=True)
provide grouping keys during update
auc.update(predictions=predictions, labels=labels, weights=weights, grouping_keys=grouping_keys)

Full Changelog:


15 Dec 02:27
Choose a tag to compare


We observed performance regression due to a bottleneck in sparse data distribution for models that have multiple, large KJTs to redistribute.

To combat this we altered the comms pattern to transport the minimum data required in the initial collective to support the collective calls for the actual KJT tensor data. This data sent in the initial collective, ‘splits’ means more data is transmitted over the comms stream overall, but the CPU is blocked for significantly shorter amounts of time leading to better overall QPS.

Furthermore, we altered the TorchRec train pipeline to group the initial collective calls for the splits together before launching the more expensive KJT tensor collective calls. This pseudo ‘fusing’ minimizes the CPU blocked time as launching each subsequent input distribution is no longer dependent on the previous input distribution.

We no longer pass in variable batch size in the sharder


On the planner side, we introduced a new feature “early stopping” to GreedyProposer. This brings a 4X speedup to planner when there are many proposals (>1000) to propose with. To use the feature, simply add “threshold=10” to GreedyProposer (10 is the suggested number for it, which means GreedyProposer will stop proposing after seeing 10 consecutive bad proposals). Secondly, we refactored the “deepcopy” logic in the planner code, which bring a 8X speedup on the overall planning time. See PR #665 for the details.

Pinning requirements

We are also pinning requirements to add more stability to TorchRec users