Could distil-whisper load local speech dataset? #50

shuaijiang · 2023-12-05T10:06:05Z

distil-whisper load dataset such as common_voice which can be accessed on huggingface.
But loading the private speech dataset is not supported.

I implement one method to load local speech dataset( json file)， it just works, not prefect,
https://github.com/shuaijiang/distil-whisper/blob/main/training/run_distillation_local_datasets.py

sanchit-gandhi · 2023-12-08T09:35:35Z

Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset

So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!

wntg · 2023-12-12T08:27:11Z

Hey @sanchit-gandhi！ If I want to train another language such as Chinese，How many dataset I need to ready？

sanchit-gandhi · 2023-12-12T17:57:42Z

Hey @wntg - there's some detailed information about the amount of data you need for each training method at the end of this README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

shuaijiang · 2023-12-14T03:55:32Z

Hey @sanchit-gandhi！ If I want to train another language such as Chinese，How many dataset I need to ready？

In my exp, 1000~2000 hours high quality Chinese speech data improve a lot, maybe cer from 20 to 10.
10000 hours speech data also helps, maybe cer from 10 to 5.
Addationly, fine-tuning all parameters seems better than LORA, you can ref to https://github.com/shuaijiang/Whisper-Finetune/blob/master/finetune_all.py

shuaijiang · 2023-12-14T03:55:54Z

Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset

So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!

thanks, I will try

xingchensong · 2023-12-26T10:27:33Z

wenet enables (full-parameter) fine-tuning of the whisper-large model in approximately 10 hours on the aishell-1 dataset, with 40 epochs and 8 * 3090 compute resources.

For more information, refer to the aishell-1 recipe available at https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/whisper

I believe that using wenet will simplify the creation of local speech datasets.

Furthermore, it is significantly easier to streaming whisper by fine-tuning it under wenet's u2++ framework. Simply treat whisper as a large transformer model and leverage all existing wenet functionality (like chunk_mask, ctc-aed hybrid loss and so on). Please see wenet-e2e/wenet#2141 for more details.

sanchit-gandhi · 2024-01-15T18:04:19Z

Definitely more data will help here! I left some recommendations in the README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

Really cool to see that you've been working on Chinese - excited to see the model you train 🚀 Let me know how you get on @shuaijiang!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could distil-whisper load local speech dataset? #50

Could distil-whisper load local speech dataset? #50

shuaijiang commented Dec 5, 2023 •

edited

sanchit-gandhi commented Dec 8, 2023

wntg commented Dec 12, 2023

sanchit-gandhi commented Dec 12, 2023 •

edited

shuaijiang commented Dec 14, 2023 •

edited

shuaijiang commented Dec 14, 2023

xingchensong commented Dec 26, 2023

sanchit-gandhi commented Jan 15, 2024

Could distil-whisper load local speech dataset? #50

Could distil-whisper load local speech dataset? #50

Comments

shuaijiang commented Dec 5, 2023 • edited

sanchit-gandhi commented Dec 8, 2023

wntg commented Dec 12, 2023

sanchit-gandhi commented Dec 12, 2023 • edited

shuaijiang commented Dec 14, 2023 • edited

shuaijiang commented Dec 14, 2023

xingchensong commented Dec 26, 2023

sanchit-gandhi commented Jan 15, 2024

shuaijiang commented Dec 5, 2023 •

edited

sanchit-gandhi commented Dec 12, 2023 •

edited

shuaijiang commented Dec 14, 2023 •

edited