Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing model coverage of Tensor MP and Inference Engine. #1248

Closed
hyunwoongko opened this issue Jul 23, 2021 · 19 comments
Closed

Increasing model coverage of Tensor MP and Inference Engine. #1248

hyunwoongko opened this issue Jul 23, 2021 · 19 comments

Comments

@hyunwoongko
Copy link
Contributor

hyunwoongko commented Jul 23, 2021

@RezaYazdaniAminabadi
Let's continue the discussion we had in the offline chat here.

First of all I want to refactor the module_injection.py and inference/engine.py codes. The current codes consist of too many nested functions, and all logic is contained in one function, so it is a bit complicated. What do you think?

@hyunwoongko hyunwoongko changed the title Increasing model scalability of Tensor MP and Inference Engine. Increasing model coverage of Tensor MP and Inference Engine. Jul 23, 2021
@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 2, 2021

@stas00 it has been checked that the training can be implemented without the CUDA kernel. The parallelformers can train almost all models with tensor parallel. However, I need little more time to use the CUDA kernel. What do you want?

  1. Implement tensor parallel training & inference without CUDA kernel now.
  2. Implement tensor parallel training & inference with CUDA kernel later.

refer to https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/mappings.py#L76 and https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/mappings.py#L92 There are backward function.

@stas00
Copy link
Contributor

stas00 commented Aug 3, 2021

I'd say, that since you already have it implemented, let's make it available to users now.

We just need to make sure that once CUDA kernels become available the user have little to no change in how they deploy the models, so ideally just having a flag to choose python vs. cuda and we sort out all the requirements changes behind the scenes.

So if we have a pure-python class, we just swap in a CUDA version of the same. And we will need a mechanism to build the extensions, so for example it can be Deepspeed or some other separate repo/package. This is so that transformers remains pure python.

Please let me know if you have any questions.

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 4, 2021

@stas00 Ok. I will proceed with the two tasks in parallel.

First of all, on the HF transformers side, I will implement the tensor parallel feature without CUDA. At the same time, on the DeepSpeed side, I will refactor the code and find a way to use CUDA kernel together. By making the interface of both side the same, I will make HF transformers automatically use the DeepSpeed code when the CUDA kernel becomes available in the future.

Thanks.

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 4, 2021

@RezaYazdaniAminabadi
you didn't answer (It's ok. I think you are busy now 😂), but I'll do some refactoring. It will be easier to work with cleaner code to support more models.

@stas00
Copy link
Contributor

stas00 commented Aug 4, 2021

Sounds like a great plan to me, @hyunwoongko! Thank you!

If you need me please don't hesitate to tag me.

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @hyunwoongko

Sorry for the late reply, I was OOF for the last couple of days. I agree with you that the API needs some refactoring to help make it more readable. Please go ahead and submit a PR with your current changes and I will review and add changes as needed.
Thanks a lot.

Btw, most of the Bert- and GPT-like models that can easily benefit from the kernels to run faster inference. We can add the support orthogonal to the model-parallelization.

Best,
Reza

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 24, 2021

Currently, our team is developing our large-scale model for the Asian language, and most of my resources are used to this work, so I Haven't been able to invest a lot of time in collaboration until now. Sorry. I plan to resume work from this weekend.

@RezaYazdaniAminabadi Are you not considering having your own implementation of mpu and tensor mp inside deepspeed? I want to implement mpu and tensor mp similar to those used in megatron lm. It also includes a training function, so it is no longer limited to deep speed-inference. How about making both inference and training possible?

I think there is some misalignment between each other.

@stas00 wants a training function based on tensor mp.
@RezaYazdaniAminabadi wants to improve deepspeed-inference.
How would you like to work?

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

A small correction - HF transformers wants both training and inference. It's just that we have a lot more users using the library for training. So there is definitely not misalignment between the two.

Remember that Deepspeed already has TP and PP, so they are just missing inference. (correction: no TP!)

HF Transformers doesn't have those yet, hence the difference.

@hyunwoongko
Copy link
Contributor Author

@stas00 where is deepspeed TP? Do you mean Megatron LM Integration?
I don't know where the TP is because I haven't tried all the features of DeepSpeed.

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

You know what - you're absolutely correct, @hyunwoongko!

Deepspeed doesn't have it! I have just never tried to use it so I assumed that since many Deepspeed docs talk about 3D, it included tensor parallelism, but it doesn't! I just found this comment by Shaden:

Tensor slicing is a model-specific strategy, which is why DeepSpeed supports, but does not provide it.

(source: #673 (comment))

Thank you!

I will update this doc to remove Deepspeed: https://huggingface.co/transformers/parallelism.html#tensor-parallelism
As it's incorrect! Here: huggingface/transformers#13248

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 25, 2021

@stas00 That's why we should use Megatron-LM for 3D parallelization. (nowadays, I am working with Megatron-LM everyday like you 😅) But, To support all kinds of parallelization for Huggingface transformers, we need TP and MPU like Megatron, but without Megatron. (Because Megatron is too complex and I think it's not good design to cover various model architectures in HF transformers) Anyway, we should implement TP and MPU anywhere (DS or HF)

I can implement TP and MPU into huggingface sides. I already tested training feature of parallelformers and it works. but MPU and TP training feature are needed. but you told me you want to integrate through deepspeed.

That's why I told you we should adjust it. because It is not clear deepspeed team also want to implement TP training feature and MPU inside deepseed. currently deepspeed team support inference-only TP.

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

I can implement TP and MPU into huggingface sides. I already tested training feature of parallelformers and it works. but MPU and TP training feature are needed. but you told me you want to integrate through deepspeed.

I think we have a miscommunication here. I was trying to communicate that in the process of implementing a pure-python TP in HF transformers, it'd be great to keep in mind DS's cuda kernels so that we could swap those in when such are available. I have not suggested to implement TP through deepspeed. We want HF transformers to be independent for the core 3D parallelism functionality and optionally be sped up by various extensions.

@hyunwoongko, would it help for us to have a call so that we could flesh out the details and hopefully there will be more clarity about this. I remember you were concerned with language issues. Since my Korean consists of one word, I wonder if @jaketae would be interested to participate and help with translation. I'm in the PST.

If we do that perhaps let's invite @siddk as well, so that we perhaps could create a shared plan to integrate your and his projects, as you both are working on some overlapping solutions.

@jaketae
Copy link

jaketae commented Aug 25, 2021

I'd be more than happy to assist with translation, though I doubt @hyunwoongko will have issues. Plus I know you're secretly hiding your Korean skills, @stas00.

We can coordinate a date and time for further discussion and proceed from there? Perhaps a when2meet?

@hyunwoongko
Copy link
Contributor Author

I think it would be good to discuss the issue on the Huggingface side. This is a deepspeed repo.

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

no secret Korean yet. I'm good with languages, but Asian ones I find too hard for me - since my native tongue isn't tonal.

I'm arranging a slack channel for all involved, including Reza. and we can then sort things out including a meeting. I hope all should be added/invited in the next day or two once I find an admin who is not on vacation.

@hyunwoongko
Copy link
Contributor Author

Good. Could you invite me slack channel?

@RezaYazdaniAminabadi
Copy link
Contributor

@hyunwoongko, it is definitely exciting to bring the tensor-slicing in deepspeed both for training and inference. Looking forward to talking with you all :)

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 25, 2021

@RezaYazdaniAminabadi Great. Firstly, Is it ok to bring tensor MP and MPU without fused kernel into DeepSpeed? I hope DeepSpeed supports Tensor MP without Megatron LM. Let's implement pure python version first, and we can discuss about applying kernel fusion.

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

Good. Could you invite me slack channel?

The invite was sent to your naver.com email you emailed me from. Please email me if you haven't received it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants