Increasing model coverage of Tensor MP and Inference Engine. #1248

hyunwoongko · 2021-07-23T22:14:56Z

@RezaYazdaniAminabadi
Let's continue the discussion we had in the offline chat here.

First of all I want to refactor the module_injection.py and inference/engine.py codes. The current codes consist of too many nested functions, and all logic is contained in one function, so it is a bit complicated. What do you think?

The text was updated successfully, but these errors were encountered:

hyunwoongko · 2021-08-02T04:02:58Z

@stas00 it has been checked that the training can be implemented without the CUDA kernel. The parallelformers can train almost all models with tensor parallel. However, I need little more time to use the CUDA kernel. What do you want?

Implement tensor parallel training & inference without CUDA kernel now.
Implement tensor parallel training & inference with CUDA kernel later.

refer to https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/mappings.py#L76 and https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/mappings.py#L92 There are backward function.

stas00 · 2021-08-03T23:36:56Z

I'd say, that since you already have it implemented, let's make it available to users now.

We just need to make sure that once CUDA kernels become available the user have little to no change in how they deploy the models, so ideally just having a flag to choose python vs. cuda and we sort out all the requirements changes behind the scenes.

So if we have a pure-python class, we just swap in a CUDA version of the same. And we will need a mechanism to build the extensions, so for example it can be Deepspeed or some other separate repo/package. This is so that transformers remains pure python.

Please let me know if you have any questions.

hyunwoongko · 2021-08-04T01:15:46Z

@stas00 Ok. I will proceed with the two tasks in parallel.

First of all, on the HF transformers side, I will implement the tensor parallel feature without CUDA. At the same time, on the DeepSpeed side, I will refactor the code and find a way to use CUDA kernel together. By making the interface of both side the same, I will make HF transformers automatically use the DeepSpeed code when the CUDA kernel becomes available in the future.

Thanks.

hyunwoongko · 2021-08-04T01:18:37Z

@RezaYazdaniAminabadi
you didn't answer (It's ok. I think you are busy now 😂), but I'll do some refactoring. It will be easier to work with cleaner code to support more models.

stas00 · 2021-08-04T01:19:22Z

Sounds like a great plan to me, @hyunwoongko! Thank you!

If you need me please don't hesitate to tag me.

RezaYazdaniAminabadi · 2021-08-05T05:30:48Z

Hi @hyunwoongko

Sorry for the late reply, I was OOF for the last couple of days. I agree with you that the API needs some refactoring to help make it more readable. Please go ahead and submit a PR with your current changes and I will review and add changes as needed.
Thanks a lot.

Btw, most of the Bert- and GPT-like models that can easily benefit from the kernels to run faster inference. We can add the support orthogonal to the model-parallelization.

Best,
Reza

hyunwoongko · 2021-08-24T23:37:04Z

Currently, our team is developing our large-scale model for the Asian language, and most of my resources are used to this work, so I Haven't been able to invest a lot of time in collaboration until now. Sorry. I plan to resume work from this weekend.

@RezaYazdaniAminabadi Are you not considering having your own implementation of mpu and tensor mp inside deepspeed? I want to implement mpu and tensor mp similar to those used in megatron lm. It also includes a training function, so it is no longer limited to deep speed-inference. How about making both inference and training possible?

I think there is some misalignment between each other.

@stas00 wants a training function based on tensor mp.
@RezaYazdaniAminabadi wants to improve deepspeed-inference.
How would you like to work?

stas00 · 2021-08-25T01:51:23Z

A small correction - HF transformers wants both training and inference. It's just that we have a lot more users using the library for training. So there is definitely not misalignment between the two.

Remember that Deepspeed already has TP and PP, so they are just missing inference. (correction: no TP!)

HF Transformers doesn't have those yet, hence the difference.

hyunwoongko · 2021-08-25T02:55:20Z

@stas00 where is deepspeed TP? Do you mean Megatron LM Integration?
I don't know where the TP is because I haven't tried all the features of DeepSpeed.

stas00 · 2021-08-25T03:06:15Z

You know what - you're absolutely correct, @hyunwoongko!

Deepspeed doesn't have it! I have just never tried to use it so I assumed that since many Deepspeed docs talk about 3D, it included tensor parallelism, but it doesn't! I just found this comment by Shaden:

Tensor slicing is a model-specific strategy, which is why DeepSpeed supports, but does not provide it.

(source: #673 (comment))

Thank you!

I will update this doc to remove Deepspeed: https://huggingface.co/transformers/parallelism.html#tensor-parallelism
As it's incorrect! Here: huggingface/transformers#13248

hyunwoongko · 2021-08-25T03:38:12Z

@stas00 That's why we should use Megatron-LM for 3D parallelization. (nowadays, I am working with Megatron-LM everyday like you 😅) But, To support all kinds of parallelization for Huggingface transformers, we need TP and MPU like Megatron, but without Megatron. (Because Megatron is too complex and I think it's not good design to cover various model architectures in HF transformers) Anyway, we should implement TP and MPU anywhere (DS or HF)

I can implement TP and MPU into huggingface sides. I already tested training feature of parallelformers and it works. but MPU and TP training feature are needed. but you told me you want to integrate through deepspeed.

That's why I told you we should adjust it. because It is not clear deepspeed team also want to implement TP training feature and MPU inside deepseed. currently deepspeed team support inference-only TP.

stas00 · 2021-08-25T04:34:21Z

I can implement TP and MPU into huggingface sides. I already tested training feature of parallelformers and it works. but MPU and TP training feature are needed. but you told me you want to integrate through deepspeed.

I think we have a miscommunication here. I was trying to communicate that in the process of implementing a pure-python TP in HF transformers, it'd be great to keep in mind DS's cuda kernels so that we could swap those in when such are available. I have not suggested to implement TP through deepspeed. We want HF transformers to be independent for the core 3D parallelism functionality and optionally be sped up by various extensions.

@hyunwoongko, would it help for us to have a call so that we could flesh out the details and hopefully there will be more clarity about this. I remember you were concerned with language issues. Since my Korean consists of one word, I wonder if @jaketae would be interested to participate and help with translation. I'm in the PST.

If we do that perhaps let's invite @siddk as well, so that we perhaps could create a shared plan to integrate your and his projects, as you both are working on some overlapping solutions.

jaketae · 2021-08-25T05:00:30Z

I'd be more than happy to assist with translation, though I doubt @hyunwoongko will have issues. Plus I know you're secretly hiding your Korean skills, @stas00.

We can coordinate a date and time for further discussion and proceed from there? Perhaps a when2meet?

hyunwoongko · 2021-08-25T05:03:14Z

I think it would be good to discuss the issue on the Huggingface side. This is a deepspeed repo.

stas00 · 2021-08-25T05:11:02Z

no secret Korean yet. I'm good with languages, but Asian ones I find too hard for me - since my native tongue isn't tonal.

I'm arranging a slack channel for all involved, including Reza. and we can then sort things out including a meeting. I hope all should be added/invited in the next day or two once I find an admin who is not on vacation.

hyunwoongko · 2021-08-25T06:02:33Z

Good. Could you invite me slack channel?

RezaYazdaniAminabadi · 2021-08-25T06:56:58Z

@hyunwoongko, it is definitely exciting to bring the tensor-slicing in deepspeed both for training and inference. Looking forward to talking with you all :)

hyunwoongko · 2021-08-25T07:08:34Z

@RezaYazdaniAminabadi Great. Firstly, Is it ok to bring tensor MP and MPU without fused kernel into DeepSpeed? I hope DeepSpeed supports Tensor MP without Megatron LM. Let's implement pure python version first, and we can discuss about applying kernel fusion.

stas00 · 2021-08-25T15:40:09Z

Good. Could you invite me slack channel?

The invite was sent to your naver.com email you emailed me from. Please email me if you haven't received it.

hyunwoongko changed the title ~~Increasing model scalability of Tensor MP and Inference Engine.~~ Increasing model coverage of Tensor MP and Inference Engine. Jul 23, 2021

hyunwoongko mentioned this issue Jul 23, 2021

Questions about implementing model parallelism in the inference engine #1161

Closed

hyunwoongko closed this as completed Sep 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing model coverage of Tensor MP and Inference Engine. #1248

Increasing model coverage of Tensor MP and Inference Engine. #1248

hyunwoongko commented Jul 23, 2021 •

edited

hyunwoongko commented Aug 2, 2021 •

edited

stas00 commented Aug 3, 2021

hyunwoongko commented Aug 4, 2021 •

edited

hyunwoongko commented Aug 4, 2021 •

edited

stas00 commented Aug 4, 2021 •

edited

RezaYazdaniAminabadi commented Aug 5, 2021

hyunwoongko commented Aug 24, 2021 •

edited

stas00 commented Aug 25, 2021 •

edited

hyunwoongko commented Aug 25, 2021

stas00 commented Aug 25, 2021 •

edited

hyunwoongko commented Aug 25, 2021 •

edited

stas00 commented Aug 25, 2021

jaketae commented Aug 25, 2021

hyunwoongko commented Aug 25, 2021

stas00 commented Aug 25, 2021 •

edited

hyunwoongko commented Aug 25, 2021

RezaYazdaniAminabadi commented Aug 25, 2021

hyunwoongko commented Aug 25, 2021 •

edited

stas00 commented Aug 25, 2021

Increasing model coverage of Tensor MP and Inference Engine. #1248

Increasing model coverage of Tensor MP and Inference Engine. #1248

Comments

hyunwoongko commented Jul 23, 2021 • edited

hyunwoongko commented Aug 2, 2021 • edited

stas00 commented Aug 3, 2021

hyunwoongko commented Aug 4, 2021 • edited

hyunwoongko commented Aug 4, 2021 • edited

stas00 commented Aug 4, 2021 • edited

RezaYazdaniAminabadi commented Aug 5, 2021

hyunwoongko commented Aug 24, 2021 • edited

stas00 commented Aug 25, 2021 • edited

hyunwoongko commented Aug 25, 2021

stas00 commented Aug 25, 2021 • edited

hyunwoongko commented Aug 25, 2021 • edited

stas00 commented Aug 25, 2021

jaketae commented Aug 25, 2021

hyunwoongko commented Aug 25, 2021

stas00 commented Aug 25, 2021 • edited

hyunwoongko commented Aug 25, 2021

RezaYazdaniAminabadi commented Aug 25, 2021

hyunwoongko commented Aug 25, 2021 • edited

stas00 commented Aug 25, 2021

hyunwoongko commented Jul 23, 2021 •

edited

hyunwoongko commented Aug 2, 2021 •

edited

hyunwoongko commented Aug 4, 2021 •

edited

hyunwoongko commented Aug 4, 2021 •

edited

stas00 commented Aug 4, 2021 •

edited

hyunwoongko commented Aug 24, 2021 •

edited

stas00 commented Aug 25, 2021 •

edited

stas00 commented Aug 25, 2021 •

edited

hyunwoongko commented Aug 25, 2021 •

edited

stas00 commented Aug 25, 2021 •

edited

hyunwoongko commented Aug 25, 2021 •

edited