Update dependency torchaudio to v2 #116

mend-for-github-com · 2023-03-15T20:35:14Z

This PR contains the following updates:

Package	Update	Change
torchaudio	major	`==0.8.0` -> `==2.3.1`
torchaudio	major	`==0.7.2` -> `==2.3.1`

Release Notes

pytorch/audio (torchaudio)

`v2.3.1`: TorchAudio 2.3.1 Release

Compare Source

This release is compatible with PyTorch 2.3.1 patch release. There are no new features added.

`v2.3.0`: TorchAudio 2.3.0 Release

Compare Source

This release is compatible with PyTorch 2.3.0 patch release. There are no new features added.

This release contains minor documentation and code quality improvements (#3734, #3748, #3757, #3759)

`v2.2.2`: TorchAudio 2.2.2 Release

Compare Source

This release is compatible with PyTorch 2.2.2 patch release. There are no new features added.

`v2.2.1`: TorchAudio 2.2.1 Release

Compare Source

This release is compatible with PyTorch 2.2.1 patch release. There are no new features added.

`v2.2.0`: TorchAudio 2.2.0 Release

Compare Source

New Features

Add path-like object support to StreamReader/Writer https://github.com/pytorch/audio/pull/3608
Introduce trio top-level module, dedicated for core I/O operations (https://github.com/pytorch/audio/pull/3676, https://github.com/pytorch/audio/pull/3680, https://github.com/pytorch/audio/pull/3681, https://github.com/pytorch/audio/pull/3682) Please refer to https://pytorch.org/audio/2.2.0/torio.html for the details.

Bug Fixes

https://github.com/pytorch/audio/pull/3685 Make F.vad return empty tensor for zero valued tensor input

Recipe Updates

https://github.com/pytorch/audio/pull/3631 Fix inconsistent naming

`v2.1.2`: TorchAudio 2.1.2 Release

Compare Source

This is a patch release, which is compatible with PyTorch 2.1.2. There are no new features added.

`v2.1.1`

Compare Source

This is a minor release, which is compatible with PyTorch 2.1.1 and includes bug fixes, improvements and documentation updates.

Bug Fixes

Cherry-pick 2.1.1: Fix WavLM bundles (#3665)
Cherry-pick 2.1.1: Add back compression level in i/o dispatcher backend by (#3666)

`v2.1.0`: Torchaudio 2.1 Release Note

Compare Source

Hilights

TorchAudio v2.1 introduces the new features and backward-incompatible changes;

[BETA] A new API to apply filter, effects and codec
torchaudio.io.AudioEffector can apply filters, effects and encodings to waveforms in online/offline fashion.
You can use it as a form of augmentation.
Please refer to https://pytorch.org/audio/2.1/tutorials/effector_tutorial.html for the examples.
[BETA] Tools for forced alignment
New functions and a pre-trained model for forced alignment were added.
torchaudio.functional.forced_align computes alignment from an emission and torchaudio.pipelines.MMS_FA provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.
Please refer to https://pytorch.org/audio/2.1/tutorials/ctc_forced_alignment_api_tutorial.html for the usage of forced_align function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can use MMS_FA to align transcript in multiple languages.
[BETA] TorchAudio-Squim : Models for reference-free speech assessment
Model architectures and pre-trained models from the paper TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio were added.
You can use torchaudio.pipelines.SQUIM_SUBJECTIVE and torchaudio.pipelines.SQUIM_OBJECTIVE models to estimate the various speech quality and intelligibility metrics. This is helpful when evaluating the quality of speech generation models, such as TTS.
Please refer to https://pytorch.org/audio/2.1/tutorials/squim_tutorial.html for the detail.
[BETA] CUDA-based CTC decoder
torchaudio.models.decoder.CUCTCDecoder takes emission stored in CUDA memory and performs CTC beam search on it in CUDA device. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch's CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.
Please refer to https://pytorch.org/audio/2.1/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html for the detail.
[Prototype] Utilities for AI music generation
We are working to add utilities that are relevant to music AI. Since the last release, the following APIs were added to the prototype.
Please refer to respective documentation for the usage.
- torchaudio.prototype.chroma_filterbank
- torchaudio.prototype.transforms.ChromaScale
- torchaudio.prototype.transforms.ChromaSpectrogram
- torchaudio.prototype.pipelines.VGGISH
New recipes for training models.
Recipes for Audio-visual ASR, multi-channel DNN beamforming and TCPGen context-biasing were added.
Please refer to the recipes
Update to FFmpeg support
The version of supported FFmpeg libraries was updated.
TorchAudio v2.1 works with FFmpeg 6, 5 and 4.4. The support for 4.3, 4.2 and 4.1 are dropped.
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail of the new FFmpeg integration mechanism.
Update to libsox integration
TorchAudio now depends on libsox installed separately from torchaudio. Sox I/O backend no longer supports file-like object. (This is supported by FFmpeg backend and soundfile)
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail.

New Features

I/O

Support overwriting PTS in torchaudio.io.StreamWriter (#3135)
Include format information after filter torchaudio.io.StreamReader.get_out_stream_info (#3155)
Support CUDA frame in torchaudio.io.StreamReader filter graph (#3183, #3479)
Support YUV444P in GPU decoder (#3199)
Add additional filter graph processing to torchaudio.io.StreamWriter (#3194)
Cache and reuse HW device context in GPU decoder (#3178)
Cache and reuse HW device context in GPU encoder (#3215)
Support changing the number of channels in torchaudio.io.StreamReader (#3216)
Support encode spec change in torchaudio.io.StreamWriter (#3207)
Support encode options such as compression rate and bit rate (#3179, #3203, #3224)
Add 420p10le support to torchaudio.io.StreamReader CPU decoder (#3332)
Support multiple FFmpeg versions (#3464, #3476)
Support writing opus and mp3 with soundfile (#3554)
Add switch to disable sox integration and ffmpeg integration at runtime (#3500)

Ops

Add torchaudio.io.AudioEffector (#3163, #3372, #3374)
Add torchaudio.transforms.SpecAugment (#3309, #3314)
Add torchaudio.functional.forced_align (#3348, #3355, #3533, #3536, #3354, #3365, #3433, #3357)
Add torchaudio.functional.merge_tokens (#3535, #3614)
Add torchaudio.functional.frechet_distance (#3545)

Models

Add torchaudio.models.SquimObjective for speech enhancement (#3042, 3087, #3512)
Add torchaudio.models.SquimSubjective for speech enhancement (#3189)
Add torchaudio.models.decoder.CUCTCDecoder (#3096)

Pipelines

Add torchaudio.pipelines.SquimObjectiveBundle for speech enhancement (#3103)
Add torchaudio.pipelines.SquimSubjectiveBundle for speech enhancement (#3197)
Add torchaudio.pipelines.MMS_FA Bundle for forced alignment (#3521, #3538)

Tutorials

Add tutorial for torchaudio.io.AudioEffector (#3226)
Add tutorials for CTC forced alignment API (#3356, #3443, #3529, #3534, #3542, #3546, #3566)
Add tutorial for torchaudio.models.decoder.CUCTCDecoder (#3297)
Add tutorial for real-time av-asr (#3511)
Add tutorial for TorchAudio-SQUIM pipelines (#3279, #3313)
Split HW acceleration tutorial into nvdec/nvenc tutorials (#3483, #3478)

Recipe

Add TCPGen context-biasing Conformer RNN-T (#2890)
Add AV-ASR recipe (#3278, #3421, #3441, #3489, #3493, #3498, #3492, #3532)
Add multi-channel DNN beamforming training recipe (#3036)

Backward-incompatible changes

Third-party libraries

In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.

SoX

libsox is used for various audio I/O, filtering operations.

Pre-built binaries are avaialble via package managers, such as conda, apt and brew. Please refer to the respective documetation.

The APIs affected include;

torchaudio.load ("sox" backend)
torchaudio.info ("sox" backend)
torchaudio.save ("sox" backend)
torchaudio.sox_effects.apply_effects_tensor
torchaudio.sox_effects.apply_effects_file
torchaudio.functional.apply_codec (also deprecated, see below)

Changes related to the removal: #3232, #3246, #3497, #3035

Flashlight Text

flashlight-text is the core of CTC decoder.

Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.

The APIs affected include;

torchaudio.models.decoder.CTCDecoder

Changes related to the removal: #3232, #3246, #3236, #3339

Kaldi

A custom built libkaldi was used to implement torchaudio.functional.compute_kaldi_pitch. This function, along with libkaldi integration, is removed in this release. There is no replcement.

Changes related to the removal: #3368, #3403

I/O

Switch to the backend dispatcher (#3241)

To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher.
In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.

You can pass backend argument to torchaudio.info, torchaudio.load and torchaudio.save function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)

If you want to use the global backend mechanism, you can set the environment variable, TORCHAUDIO_USE_BACKEND_DISPATCHER=0.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.

Please see #2950 for the detail of migration work.

Remove Tensor binding from StreamReader (#3093, #3272)

torchaudio.io.StreamReader accepted a byte-string wrapped in 1D torch.Tensor object. This is no longer supported.
Please wrap the underlying data with io.BytesIO instead.

Make I/O optional arguments kw-only (#3208, #3227)

The optional arguments of add_[audio|video]_stream methods of torchaudio.io.StreamReader and torchaudio.io.StreamWriter are now keyword-only arguments.

Drop the support of FFmpeg < 4.1 (#3561, 3557)

Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.

Ops

Use named file in torchaudio.functional.apply_codec (#3397)

In previous versions, TorchAudio shipped custom built libsox, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic libsox linking, torchaudio.functional.apply_codec no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use torchaudio.io.AudioEffector.

Switch to lstsq when solving InverseMelScale (#3280)

Previously, torchaudio.transform.InverseMelScale ran SGD optimizer to find the inverse of mel-scale transform. This approach has number of issues as listed in #2643.

This release switches to use torch.linalg.lstsq.

Models

Improve RNN-T streaming decoding (#3295, #3379)

The infer method of torchaudio.models.RNNTBeamSearch has been updated to accept series of previous hypotheses.

bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
decoder: RNNTBeamSearch = bundle.get_decoder()

hypothesis = None
while streaming:
    ...
    hypo, state = decoder.infer(
        features,
        length,
        beam_width,
        state=state,
        hypothesis=hypothesis,
    )
    ...
    hypothesis = hypo

##### Previously this had to be hypothesis = hypo[0]

Deprecations

Ops

Update and deprecate torchaudio.functional.apply_codec function (#3386)

Due to the removal of custom libsox binding, torchaudio.functional.apply_codec no longer supports in-memory processing. Please migrate to torchaudio.io.AudioEffector.

Please refer to for the detailed usage of torchaudio.io.AudioEffector.

Bug Fixes

Models

Fix the negative sampling in ConformerWav2Vec2PretrainModel (#3085)
Fix extract_features method for WavLM models (#3350)

Tutorials

Fix backtracking in forced alignment tutorial (#3440)
Fix initialization of get_trellis in forced alignment tutorial (#3172)

Build

Fix MKL issue on Intel mac build (#3307)

I/O

Surpress warning when saving vorbis with sox backend (#3359)
Fix g722 encoding in torchaudio.io.StreamWriter (#3373)
Refactor arg mapping in ffmpeg save function (#3387)
Fix save INT16 sox backend (#3524)
Fix SoundfileBackend method decorators (#3550)
Fix PTS initialization when using NVIDIA encoder (#3312)

Ops

Add non-default CUDA device support to lfilter (#3432)

Improvements

I/O

Set "experimental" automatically when using native opus/vorbis encoder (#3192)
Improve the performance of NV12 frame conversion (#3344)
Improve the performance of YUV420P frame conversion (#3342)
Refactor backend implementations (#3547, #3548, #3549)
Raise an error if torchaudio.io.StreamWriter is not opened (#3152)
Warn if decoding YUV images with different plane size (#3201)
Expose AudioMetadata (#3556)
Refactor the internal of torchaudio.io.StreamReader (#3157, #3170, #3186, #3184, #3188, #3320, #3296, #3328, #3419, #3209)
Refactor the internal of torchaudio.io.StreamWriter (#3205, #3319, #3296, #3328, #3426, #3428)
Refactor the FFmpeg abstraction layer (#3249, #3251)
Migrate the binding of FFmpeg utils to PyBind11 (#3228)
Simplify sox namespace (#3383)
Use const reference in sox implementation (#3389)
Ensure StreamReader returns tensors with requires_grad is False (#3467)
Set the default #threads to 1 in StreamWriter (#3370)
Remove ffmpeg fallback from sox_io backend (#3516)

Ops

Add arbitrary dim Tensor support to mask_along_axis{,_iid} (#3289)
Fix resampling to support dynamic input lengths for onnx exports. (#3473)
Optimize Torchaudio Vad (#3382)

Documentation

Build and use GPU-enabled FFmpeg in doc CI (#3045)
Misc tutorial update (#3449)
Update notes on FFmpeg version (#3480)
Update documentation about dependencies (#3517)
Update I/O and backend docs (#3555)

Tutorials

Update data augmentation tutorial (#3375)
Add more explanation about n_fft (#3442)

Build

Resolve some compilation warnings (#3471)
Use pre-built binaries for ffmpeg extension (#3460)
Add aarch64 workflow (#3553)
Add CUDA 12.1 builds (#3284)
Update CUDA to 12.1 U1 (#3563)

Recipe

Fix Adam and AdamW initializers in wav2letter example (#3145)
Update Librispeech RNNT recipe to support Lightening 2.0 (#3336)
Update HuBERT/SSL training recipes to support Lightning 2.x (#3396)
Add wav2vec2 loss function in self_supervised_learning training recipe (#3090)
Add Wav2Vec2DataModule in self_supervised_learning training recipe (#3081)

Other

Use FFmpeg6 in build doc (#3475)
Use FFmpeg6 in unit test (#3570)
Migrate torch.norm to torch.linalg.vector_norm (#3522)
Migrate torch.nn.utils.weight_norm to nn.utils.parametrizations.weight_norm (#3523)

`v2.0.2`

Compare Source

TorchAudio 2.0.2 Release Note

This is a minor release, which is compatible with PyTorch 2.0.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

Bug fix

#3239 Properly set #samples passed to encoder (#3204)
#3238 Fix virtual function issue with CTC decoder (#3230)
#3245 Fix path-like object support in FFmpeg dispatcher (#3243, #3248)
#3261 Use scaled_dot_product_attention in Wav2vec2/HuBERT's SelfAttention (#3253)
#3264 Use scaled_dot_product_attention in WavLM attention (#3252, #3265)

Full Changelog: pytorch/audio@v2.0.1...v2.0.2

`v2.0.1`: Torchaudio 2.0 Release Note

Highlights

TorchAudio 2.0 release includes:

Data augmentation operators, e.g. convolution, additive noise, speed perturbation
WavLM and XLS-R models and pre-trained pipelines
Backend dispatcher powering revised info, load, save functions
Dropped support of Python 3.7
Added Python 3.11 support

[Beta] Data augmentation operators

The release adds several data augmentation operators under torchaudio.functional and torchaudio.transforms:

torchaudio.functional.add_noise
torchaudio.functional.convolve
torchaudio.functional.deemphasis
torchaudio.functional.fftconvolve
torchaudio.functional.preemphasis
torchaudio.functional.speed
torchaudio.transforms.AddNoise
torchaudio.transforms.Convolve
torchaudio.transforms.Deemphasis
torchaudio.transforms.FFTConvolve
torchaudio.transforms.Preemphasis
torchaudio.transforms.Speed
torchaudio.transforms.SpeedPerturbation

The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.

For usage details, please refer to the documentation for torchaudio.functional and torchaudio.transforms, and tutorial “Audio Data Augmentation”.

[Beta] WavLM and XLS-R models and pre-trained pipelines

The release adds two self-supervised learning models for speech and audio.

WavLM that is robust to noise and reverberation.
XLS-R that is trained on cross-lingual datasets.

Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:

torchaudio.pipelines.WAVLM_BASE
torchaudio.pipelines.WAVLM_BASE_PLUS
torchaudio.pipelines.WAVLM_LARGE
torchaudio.pipelines.WAV2VEC_XLSR_300M
torchaudio.pipelines.WAV2VEC_XLSR_1B
torchaudio.pipelines.WAV2VEC_XLSR_2B

For usage details, please refer to factory function and pre-trained pipelines documentation.

Backend dispatcher

Release 2.0 introduces new versions of I/O functions torchaudio.info, torchaudio.load and torchaudio.save, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1; the new logic will be enabled by default in Release 2.1.

##### Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")

##### Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")

##### Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")

Please see the documentation for torchaudio for more details.

Backward-incompatible changes

Dropped Python 3.7 support (#3020)
Following the upstream PyTorhttps://github.com/pytorch/pytorch/pull/931553155), the support for Python 3.7 has been dropped.
Default to "precise" seek in torchaudio.io.StreamReader.seek (#2737, #2841, #2915, #2916, #2970)
Previously, the StreamReader.seek method seeked into a key frame closest to the given time stamp. A new option mode has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default.
Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
The following functions are removed from datasets.utils
- stream_url
- download_url
- validate_file
- extract_archive.

Deprecations

Ops

Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
torchaudio.transforms.MelSpectrogram assumes the onesided argument to be always True. The forward path fails if its value is False. Therefore this argument is deprecated. Users specifying this argument should stop specifying it.
Deprecated "sinc_interpolation" and "kaiser_window" option value in favor of "sinc_interp_hann" and "sinc_interp_kaiser" (#2922)
The valid values of resampling_method argument of resampling operations (torchaudio.transforms.Resample and torchaudio.functional.resample) are changed. "kaiser_window" is now "sinc_interp_kaiser" and "sinc_interpolation" is "sinc_interp_hann". The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer #2891.
Deprecated sox initialization/shutdown public API functions (#3010)
torchaudio.sox_effects.init_sox_effects and torchaudio.sox_effects.shutdown_sox_effects are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.

Models

Deprecated static binding of Flashlight-text based CTC decoder (#3055, #3089)
Since v0.12, TorchAudio binary distributions included the CTC decoder based on flashlight-text project. In a future release, TorchAudio will switch to dynamic binding of underlying CTC decoder implementation, and stop shipping the core CTC decoder implementations. Users who would like to use the CTC decoder need to separately install the CTC decoder from the upstream flashlight-text project. Other functionalities of TorchAudio will continue to work without flashlight-text.
Note: The API and numerical behavior does not change.
For more detail, please refer #3088.

I/O

Deprecated file-like object support in sox_io (#3033)
As a preparation to switch to dynamically bound libsox, file-like object support in sox_io backend has been deprecated. It will be removed in 2.1 release in favor of the dispatcher. This deprecation affects the following functionalities.
- I/O: torchaudio.load, torchaudio.info and torchaudio.save.
- Effects: torchaudio.sox_effects.apply_effects_file and torchaudio.functional.apply_codec.
  For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
  For effects, replacement functions will be added in the next release.

Deprecated the use of Tensor as a container for byte string in StreamReader (#3086)
torchaudio.io.StreamReader supports decoding media from byte strings contained in 1D tensors of torch.uint8 type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string with io.BytesIO.

Deprecated	Migration
`data = b"..."` `src = torch.frombuffer(data, dtype=torch.uint8)` `StreamReader(src)`	`data = b"..."` `src = io.BytesIO(data)` `StreamReader(src)`

Bug Fixes

Ops

Fixed contiguous error when backpropagating through torchaudio.functional.lfilter (#3080)

Pipelines

Added layer normalization to wav2vec2 large+ pretrained models (#2873)
In self-supervised learning models such as Wav2Vec 2.0, HuBERT, or WavLM, layer normalization should be applied to waveforms if the convolutional feature extraction module uses layer normalization and is trained on a large-scale dataset. After adding layer normalization to those affected models, the Word Error Rate is significantly reduced.

Without the change in #2873, the WER results are:

Model	dev-clean	dev-other	test-clean	test-other
WAV2VEC2_ASR_LARGE_LV60K_10M	10.59	15.62	9.58	16.33
WAV2VEC2_ASR_LARGE_LV60K_100H	2.80	6.01	2.82	6.34
WAV2VEC2_ASR_LARGE_LV60K_960H	2.36	4.43	2.41	4.96
HUBERT_ASR_LARGE	1.85	3.46	2.09	3.89
HUBERT_ASR_XLARGE	2.21	3.40	2.26	4.05

After applying layer normalization, the updated WER results are:

Model	dev-clean	dev-other	test-clean	test-other
WAV2VEC2_ASR_LARGE_LV60K_10M	6.77	10.03	6.87	10.51
WAV2VEC2_ASR_LARGE_LV60K_100H	2.19	4.55	2.32	4.64
WAV2VEC2_ASR_LARGE_LV60K_960H	1.78	3.51	2.03	3.68
HUBERT_ASR_LARGE	1.77	3.32	2.03	3.68
HUBERT_ASR_XLARGE	1.73	2.72	1.90	3.16

Recipe

Fixed DDP training in HuBERT recipes (#3068)
If shuffle is set True in BucketizeBatchSampler, the seed is only the same for the first epoch. In later epochs, each BucketizeBatchSampler object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes.

IO

Fixed signature mismatch on _fail_info_fileobj (#3032)
Remove unnecessary AVFrame allocation (#3021)
This fixes the memory leak reported in torchaudio.io.StreamReader.

New Features

Ops

Added CUDA kernel for torchaudio.functional.lfilter (#3018)
Added data augmentation ops (#2801, #2809, #2829, #2811, #2871, #2874, #2892, #2935, #2977, #3001, #3009, #3061, #3072)
Introduces AddNoise, Convolve, FFTConvolve, Speed, SpeedPerturbation, Deemphasis, and Preemphasis in torchaudio.transforms, and add_noise, fftconvolve, convolve, speed, preemphasis, and deemphasis in torchaudio.functional.

Models

Added WavLM model (#2822, #2842)
Added XLS-R models (#2959)

Pipelines

Added WavLM bundles (#2833, #2895)
Added pre-trained pipelines for XLS-R models (#2978)

I/O

Added rgb48le and CUDA p010 support (HDR/10bit) to StreamReader (#3023)
Added fill_buffer method to torchaudio.io.StreamReader (#2954, #2971)
Added buffer_chunk_size=-1 option to torchaudio.io.StreamReader (#2969)
When buffer_chunk_size=-1, StreamReader does not drop any buffered frame. Together with the fill_buffer method, this is a recommended way to load the entire media.
```
reader = StreamReader("video.mp4")
reader.add_basic_audio_stream(buffer_chunk_size=-1)
reader.add_basic_video_stream(buffer_chunk_size=-1)
reader.fill_buffer()
audio, video = reader.pop_chunks()
```
Added PTS support to torchaudio.io.StreamReader (#2975)
torchaudio.io.SteramReader now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.
```
reader = StreamReader(...)
reader.add_basic_audio_stream(...)
reader.add_basic_video_stream(...)
for audio_chunk, video_chunk in reader.stream():
```

Fetch timestamp

    print(audio_chunk.pts)
    print(video_chunk.pts)

Chunks behave the same as torch.Tensor.

    audio_chunk.mean(dim=1)
```

Added playback function torchaudio.io.play_audio (#3026, #3051)
You can play audio with the torchaudio.io.play_audio function. (macOS only)
Added new dispatcher (#3015, #3058, #3073)

Other

Add utility functions to check information about FFmpeg (#2958, #3014)
The following functions are added to torchaudio.utils.ffmpeg_utils, which can be used to query into the dynamically linked FFmpeg libraries.
- get_demuxers()
- get_muxers()
- get_audio_decoders()
- get_audio_encoders()
- get_video_decoders()
- get_video_encoders()
- get_input_devices()
- get_output_devices()
- get_input_protocols()
- get_output_protocols()
- get_build_config()

Recipes

Add modularized SSL training recipe (#2876)

Improvements

I/O

Refactor StreamReader/Writer implementation
- Refactored StreamProcessor interface (#2791)
- Refactored Buffer implementation (#2939, #2943, #2962, #2984, #2988)
- Refactored AVFrame to Tensor conversions (#2940, #2946)
- Refactored and optimize yuv420p and nv12 processing (#2945)
- Abstracted away AVFormatContext from constructor (#3007)
- Removed unused/redundant things (#2995)
- Replaced torchaudio::ffmpeg namespace with torchaudio::io (#3013)
- Merged pop_chunks implementations (#3002)
- Cleaned up private methods (#3030)
- Moved drain method to private (#2996)
Added logging to torchaudio.io.StreamReader/Writer (#2878)
Fixed the #threads used by FilterGraph to 1 (#2985)
Fixed the default #threads used by decoder to 1 in torchaudio.io.StreamReader (#2949)
Moved libsox integration from libtorchaudio to libtorchaudio_sox (#2929)
Added query methods to FilterGraph (#2976)

Ops

Added logging to MelSpectrogram and Spectrogram (#2861)
Fixed filtering function fallback mechanism (#2953)
Enabled log probs input for RNN-T loss (#2798)
Refactored extension modules initialization (#2968)
Updated the guard mechanism for FFmpeg-related features (#3028)
Updated the guard mechanism for cuda_version (#2952)

Models

Renamed generator to vocoder in HiFiGAN model and factory functions (#2955)
Enforces contiguous tensor in CTC decoder (#3074)

Datasets

Validates the input path in LibriMix dataset (#2944)

Documentation

Fixed docs warnings for conformer w2v2 (#2900)
Updated model documentation structure (#2902)
Fixed document for MelScale and InverseMelScale (#2967)
Updated highlighting in doc (#3000)
Added installation / build instruction to doc (#3038)
Redirect build instruction to official doc (#3053)
Tweak docs around IO (#3064)
Improved docstring about input path to LibriMix (#2937)

Recipes

Simplify train step in Conformer RNN-T LibriSpeech recipe (#2981)
Update WER results for CTC n-gram decoding (#3070)
Update ssl example (#3060)
fix import bug in global_stats.py (#2858)
Fixes examples/source_separation for WSJ0_2mix dataset (#2987)

Tutorials

Added mel spectrogram visualization to Streaming ASR tutorial (#2974)
Fixed mel spectrogram visualization in TTS tutorial (#2989)
Updated data augmentation tutorial to use new operators (#3062)
Fixed hybrid demucs tutorial for CUDA (#3017)
Updated hardware accelerated video processing tutorial (#3050)

Builds

Fixed USE_CUDA detection (#3005)
Fixed USE_ROCM detection (#3008)
Added M1 Conda builds (#2840)
Added M1 Wheels builds (#2839)
Added CUDA 11.8 builds (#2951)
Switched CI to CUDA 11.7 from CUDA 11.6 (#3031, #3034)
Added python 3.11 support (#3039, #3071)
Updated C++ standard to 17 (#2973)

Tests

Fix integration test for WAV2VEC2_ASR_LARGE_LV60K_10M (#2910)
Fix CI tests on gpu machines (#2982)
Remove function input parameters from data aug functional tests (#3011)
Reduce the sample rate of some tests (#2963)

Style

Fix type of arguments in torchaudio.io classes (#2913)

`v0.13.1`: TorchAudio 0.13.1 Release Note

Compare Source

This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

Bug Fix

IO

Make buffer size configurable in ffmpeg file object operations and set size in backend (#2810)
Fix issue with the missing video frame in StreamWriter (#2789)
Fix decimal FPS handling StreamWriter (#2831)
Fix wrong frame allocation in StreamWriter (#2905)
Fix duplicated memory allocation in StreamWriter (#2906)

Model

Fix HuBERT model initialization (#2846, #2886)

Recipe

Fix issues in HuBERT fine-tuning recipe (#2851)
Fix automatic mixed precision in HuBERT pre-training recipe (#2854)

`v0.13.0`: torchaudio 0.13.0 Release Note

Compare Source

Highlights

TorchAudio 0.13.0 release includes:

Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
New datasets and metadata mode for the SUPERB benchmark
Custom language model support for CTC beam search decoding
StreamWriter for audio and video encoding

[Beta] Source Separation Models and Bundles

Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features

MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
Hybrid Demucs model architecture (docs)
Three factory functions suitable for different sample rate ranges
Pre-trained pipelines (docs) and tutorial

SDR Results of pre-trained pipelines on MUSDB-HQ test set

Pipeline	All	Drums	Bass	Other	Vocals
HDEMUCS_HIGH_MUSDB*	6.42	7.76	6.51	4.47	6.93
HDEMUCS_HIGH_MUSDB_PLUS**	9.37	11.38	10.53	7.24	8.32

* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

Special thanks to @adefossez for the guidance.

ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.

Datasets with metadata functionality:

LIBRISPEECH (docs)
LibriMix (docs)
QUESST14 (docs)
SPEECHCOMMANDS (docs)
(new) FluentSpeechCommands (docs)
(new) Snips (docs)
(new) IEMOCAP ([docs](https://pytorch.org/audio/0.13.0/generated/t

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

If you want to rebase/retry this PR, check this box

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from df26c57 to 0fafc5b Compare May 8, 2023 20:50

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from 0fafc5b to ec9ae0a Compare October 5, 2023 04:44

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from ec9ae0a to 5eb98c2 Compare November 16, 2023 04:09

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from 5eb98c2 to e496bee Compare December 15, 2023 04:23

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from e496bee to f2584b6 Compare January 31, 2024 04:39

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from f2584b6 to cb502a7 Compare February 23, 2024 04:08

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from cb502a7 to 17dd78e Compare March 28, 2024 03:50

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from 17dd78e to 5243975 Compare April 25, 2024 04:00

Update dependency torchaudio to v2

943a695

mend-for-github-com bot force-pushed the whitesource-remediate/torchaudio-2.x branch from 5243975 to 943a695 Compare June 6, 2024 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency torchaudio to v2 #116

Update dependency torchaudio to v2 #116

mend-for-github-com bot commented Mar 15, 2023 •

edited

Update dependency torchaudio to v2 #116

Are you sure you want to change the base?

Update dependency torchaudio to v2 #116

Conversation

mend-for-github-com bot commented Mar 15, 2023 • edited

Release Notes

v2.3.1: TorchAudio 2.3.1 Release

v2.3.0: TorchAudio 2.3.0 Release

v2.2.2: TorchAudio 2.2.2 Release

v2.2.1: TorchAudio 2.2.1 Release

v2.2.0: TorchAudio 2.2.0 Release

New Features

Bug Fixes

Recipe Updates

v2.1.2: TorchAudio 2.1.2 Release

v2.1.1

Bug Fixes

v2.1.0: Torchaudio 2.1 Release Note

Hilights

New Features

I/O

Ops

Models

Pipelines

Tutorials

Recipe

Backward-incompatible changes

Third-party libraries

SoX

Flashlight Text

Kaldi

I/O

Ops

Models

Deprecations

Ops

Bug Fixes

Models

Tutorials

Build

I/O

Ops

Improvements

I/O

Ops

Documentation

Tutorials

Build

Recipe

Other

v2.0.2

TorchAudio 2.0.2 Release Note

Bug fix

v2.0.1: Torchaudio 2.0 Release Note

Highlights

[Beta] Data augmentation operators

[Beta] WavLM and XLS-R models and pre-trained pipelines

Backend dispatcher

Backward-incompatible changes

Deprecations

Ops

Models

I/O

Bug Fixes

Ops

Pipelines

Recipe

IO

New Features

Ops

Models

Pipelines

I/O

Fetch timestamp

Chunks behave the same as torch.Tensor.

Other

Recipes

Improvements

I/O

Ops

mend-for-github-com bot commented Mar 15, 2023 •

edited

`v2.3.1`: TorchAudio 2.3.1 Release

`v2.3.0`: TorchAudio 2.3.0 Release

`v2.2.2`: TorchAudio 2.2.2 Release

`v2.2.1`: TorchAudio 2.2.1 Release

`v2.2.0`: TorchAudio 2.2.0 Release

`v2.1.2`: TorchAudio 2.1.2 Release

`v2.1.1`

`v2.1.0`: Torchaudio 2.1 Release Note

`v2.0.2`

`v2.0.1`: Torchaudio 2.0 Release Note

`v0.13.1`: TorchAudio 0.13.1 Release Note

`v0.13.0`: torchaudio 0.13.0 Release Note