Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a Wav2Vec2ForSpeechClassification class #12730

Closed
ehcalabres opened this issue Jul 15, 2021 · 2 comments
Closed

Adding a Wav2Vec2ForSpeechClassification class #12730

ehcalabres opened this issue Jul 15, 2021 · 2 comments
Assignees

Comments

@ehcalabres
Copy link

ehcalabres commented Jul 15, 2021

Adding a Wav2Vec2ForSpeechClassification class 馃殌

Right now, using any of the Wav2Vec 2.0 models available on the 馃hub and make a fine-tuning process to resolve a speech classification task implies creating a new class that inherit his behaviour from the Wav2Vec2PreTrainedModel class. Although creating this types of models can be done with a bit of research, I find too complicated to just use a fine-tuned model when shared on the 馃hub, because you need to have access to the code of the model class in order to instantiate it and retrieve the model with the from_pretrained() method (which may or may not be available at that time).

I think that adding a class to the 馃transformers library like Wav2Vec2ForSpeechClassification (i.e. the same way that works for the BertForSequenceClassification models and others similar) will be a very nice feature in order to not just be able to fine-tune Wav2Vec 2.0 for classification tasks but also it would simplify and accelerate the way one can use a shared model.

Motivation

Speech has always been a very awesome field of research both in the way a user interacts with a physical system, and vice versa. Taking this into account, and with the great news of having the new Wav2Vec 2.0 model integrated on the 馃transformers library 馃帀, I started a research project on Speech Emotion Recognition (SER) with the idea of fine-tune a Wav2Vec 2.0 model in this type of emotional datasets. The results that I've obtained are very promising and the model seems to work extremely well, so I decided to put the fine-tuned model on the 馃hub (wip). Additionally, I saw on the 馃 discussion forums a topic about this same task of SER implementation with its corresponding model on the 馃hub, which have the same issue when importig it.

With all this, I think that the number of use cases of the Wav2Vec2 model for speech classification tasks are huge and having a feature like this one implemented would simplify a lot the way other developers and researchers can work with this type of pretrained models.

Your contribution

I can start working in a new PR to overcome this situation by implementing the Wav2Vec2ForSpeechClassification class that I mentioned before in the library. I already have the code working and in fact it's pretty similar to the other nlp models that include the SequenceClassification feature.

The idea behind this is to have a much more simplified and generalized way to use and train this models, getting as final result this snippet for a straight forward use of them.

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSpeechClassification
  
processor = Wav2Vec2FeatureExtractor.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

model = Wav2Vec2ForSpeechClassification.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")

Let me know if this feature fits the needs of the library in terms of simplicity and integration, and I will start a new PR with these changes. Also let me know if it is useful and cover an adecuate number of use cases, making it worth of implementing.

Thank you all for your amazing work 馃

@patrickvonplaten
Copy link
Contributor

Hey @ehcalabres,

I'm only seeing your issue now sadly :-/ Super sorry to not have answered sooner. @anton-l is working on an official Wav2Vec2- and HubertForSequenceClassification at the moment, here: #13153 which should serve your needs then :-)

It would be great if you could take a look at #13153 to see whether this design/architecture fits your needs

@ehcalabres
Copy link
Author

Hey @patrickvonplaten, @anton-l,

Thanks a lot for your answer! As I'm seeing on the issue #13153 , it seems like it's pretty much the same as I was proposing here, so I think it'll do the job for this kind of audio classification tasks. I'll try it when it comes out but it seems to be fine by the moment. Great!

Only one thing, I've work mostly in PyTorch but as I was checking the code I've seen that there's no TensorFlow version of these models (neither for Hubert or Wav2Vec2), do you think it's relevant to implement them? If so maybe I can help with that, but I don't know if it's something critical.

Anyway, is there anything else I can do to help you with this? Just let me know.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants