-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could distil-whisper load local speech dataset? #50
Comments
Hey @shuaijiang! For training Distil-Whisper, you can convert any custom dataset to Hugging Face Datasets' format using this guide: https://huggingface.co/docs/datasets/audio_dataset So long as you can load your dataset as a Python object, you can convert it to Hugging Face Datasets! |
Hey @sanchit-gandhi! If I want to train another language such as Chinese,How many dataset I need to ready? |
Hey @wntg - there's some detailed information about the amount of data you need for each training method at the end of this README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods |
In my exp, 1000~2000 hours high quality Chinese speech data improve a lot, maybe cer from 20 to 10. |
thanks, I will try |
wenet enables (full-parameter) fine-tuning of the whisper-large model in approximately 10 hours on the aishell-1 dataset, with 40 epochs and 8 * 3090 compute resources. For more information, refer to the aishell-1 recipe available at https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/whisper I believe that using wenet will simplify the creation of local speech datasets. Furthermore, it is significantly easier to streaming whisper by fine-tuning it under wenet's u2++ framework. Simply treat whisper as a large transformer model and leverage all existing wenet functionality (like chunk_mask, ctc-aed hybrid loss and so on). Please see wenet-e2e/wenet#2141 for more details. |
Definitely more data will help here! I left some recommendations in the README: https://github.com/huggingface/distil-whisper/tree/main/training#overview-of-training-methods Really cool to see that you've been working on Chinese - excited to see the model you train 🚀 Let me know how you get on @shuaijiang! |
distil-whisper load dataset such as common_voice which can be accessed on huggingface.
But loading the private speech dataset is not supported.
I implement one method to load local speech dataset( json file), it just works, not prefect,
https://github.com/shuaijiang/distil-whisper/blob/main/training/run_distillation_local_datasets.py
The text was updated successfully, but these errors were encountered: