Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could distil-whisper load local speech dataset? #50

Open
shuaijiang opened this issue Dec 5, 2023 · 7 comments
Open

Could distil-whisper load local speech dataset? #50

shuaijiang opened this issue Dec 5, 2023 · 7 comments

Comments

@shuaijiang
Copy link

shuaijiang commented Dec 5, 2023

distil-whisper load dataset such as common_voice which can be accessed on huggingface.
But loading the private speech dataset is not supported.

I implement one method to load local speech dataset( json file), it just works, not prefect,
https://github.com/shuaijiang/distil-whisper/blob/main/training/run_distillation_local_datasets.py

@sanchit-gandhi
Copy link
Collaborator

Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset

So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!

@wntg
Copy link

wntg commented Dec 12, 2023

Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready?

@sanchit-gandhi
Copy link
Collaborator

sanchit-gandhi commented Dec 12, 2023

Hey @wntg - there's some detailed information about the amount of data you need for each training method at the end of this README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

@shuaijiang
Copy link
Author

shuaijiang commented Dec 14, 2023

Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready?

In my exp, 1000~2000 hours high quality Chinese speech data improve a lot, maybe cer from 20 to 10.
10000 hours speech data also helps, maybe cer from 10 to 5.
Addationly, fine-tuning all parameters seems better than LORA, you can ref to https://github.com/shuaijiang/Whisper-Finetune/blob/master/finetune_all.py

@shuaijiang
Copy link
Author

Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset

So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets!

thanks, I will try

@xingchensong
Copy link

wenet enables (full-parameter) fine-tuning of the whisper-large model in approximately 10 hours on the aishell-1 dataset, with 40 epochs and 8 * 3090 compute resources.

For more information, refer to the aishell-1 recipe available at https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/whisper

I believe that using wenet will simplify the creation of local speech datasets.

Furthermore, it is significantly easier to streaming whisper by fine-tuning it under wenet's u2++ framework. Simply treat whisper as a large transformer model and leverage all existing wenet functionality (like chunk_mask, ctc-aed hybrid loss and so on). Please see wenet-e2e/wenet#2141 for more details.

@sanchit-gandhi
Copy link
Collaborator

Definitely more data will help here! I left some recommendations in the README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods

Really cool to see that you've been working on Chinese - excited to see the model you train 🚀 Let me know how you get on @shuaijiang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants