New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Pipeline Parallel Support #4412
base: main
Are you sure you want to change the base?
Conversation
@andoorve - Exciting!!! |
@andoorve thanks for the effort! Can you write an RFC to describe the overall design so that people can easily understand it? example rfcs: https://github.com/vllm-project/vllm/issues?q=label%3ARFC+sort%3Aupdated-desc |
@youkaichao Yes for sure, it is one of the TODO items above |
vllm/worker/model_runner.py
Outdated
@@ -746,7 +763,8 @@ def execute_model( | |||
logits = self.model.compute_logits(hidden_states, sampling_metadata) | |||
|
|||
# Only perform sampling in the driver worker. | |||
if not self.is_driver_worker: | |||
if (not (is_pipeline_model_parallel_last_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for tp, the first rank (driver) performs sampling, and for pp, the last rank (the last worker in the last pp's tp group) performs sampling, is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the first worker of the last PP's TP group
Updated the RFC here: #4461 @youkaichao Let me know if anything needs further elaboration |
FYI pretty sure PyTorch has a bug, filed here: pytorch/pytorch#125079 Worked around this last week by making sending and receiving phase for each model atomic by concatenating residuals and hidden states. |
Sounds good @youkaichao, I can update mine once that's merged. Will you also include the change to create the multiple CPU TP groups or should I create a separate PR? |
Yes, that's also in my plan. I will break #4460 down into small pieces to be merged, ETA this week. |
Sounds good - I'll revert the PyNCCL changes on this PR and wait for that to be merged to add in |
Hey @andoorve - This is super exciting! I'm trying to run a simple example with - llm = LLM(model="facebook/opt-125m", load_format="dummy")
+ llm = LLM(model="facebook/opt-2.7b", pipeline_parallel_size=2, load_format="dummy") This is the error I hit: error.txt. It seems like it's complaining the
ERROR 05-01 20:45:18 worker_base.py:147] Traceback (most recent call last):
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/worker_base.py", line 139, in execute_method
ERROR 05-01 20:45:18 worker_base.py:147] return executor(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147] return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/worker.py", line 140, in determine_num_available_blocks
ERROR 05-01 20:45:18 worker_base.py:147] self.model_runner.profile_run()
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147] return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/model_runner.py", line 844, in profile_run
ERROR 05-01 20:45:18 worker_base.py:147] self.execute_model(seqs, kv_caches)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-01 20:45:18 worker_base.py:147] return func(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/worker/model_runner.py", line 763, in execute_model
ERROR 05-01 20:45:18 worker_base.py:147] hidden_states = model_executable(**execute_model_kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward
ERROR 05-01 20:45:18 worker_base.py:147] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward
ERROR 05-01 20:45:18 worker_base.py:147] return self.decoder(input_ids, positions, kv_caches, attn_metadata)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return self._call_impl(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm-pp-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR 05-01 20:45:18 worker_base.py:147] return forward_call(*args, **kwargs)
ERROR 05-01 20:45:18 worker_base.py:147] File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward
ERROR 05-01 20:45:18 worker_base.py:147] hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
ERROR 05-01 20:45:18 worker_base.py:147] IndexError: list index out of range
I haven't dug into the code deep enough, and curious what is the best way to test and play around with it. If you can point me to some potential starting point, that would be awesome enough. Thanks! |
Hey @GindaChen there's a couple of things here, We haven't supported OPT yet, also the LLMEngine entry point won't work. We're only supporting AsyncLLMEngine right now |
The way I would recommend is try with the online serving entrypoint with the LLaMa model. That'd be the best way to start playing around with it |
LGTM - I guess one thing we can add is PP PyNCCL group |
That's in my plan. Which operation do you need for pp? allreduce? gather? or anything else? |
We only need point-to-point, blocking send and blocking recv only. It's not critical though unless |
Hi @andoorve, While benchmarking using your PR, I've consistently encountered engine timeouts with smaller models on setups far below total VRAM capacity, which might relate to the issues you've linked (e.g., [Bug]: Engine iteration timed out #4293, #4430, #4135). I'm using commit 9d698fa. Setup and Reproduction:
python -m vllm.entrypoints.openai.api_server --model JackFram/llama-160m \
--swap-space 16 \
--disable-log-requests \
--pipeline-parallel-size 2 python benchmarks/benchmark_serving.py --backend vllm --model JackFram/llama-160m \
--dataset-name sharegpt \
--dataset-path /workspace/sharegpt.json \
--num-prompts 3 Observation: Proposed Solution: I traced the issue to Branch with fix: https://github.com/SolitaryThinker/vllm/tree/pipeline-parallel-fix I noticed a new commit from you regarding TP+PP fix, but it didn’t resolve the issue in my environment. Could it be due to missing the latest pynccl changes with groups #4512? This is my first time handling VLLM and Ray, so any insights or corrections on my understanding or approach would be greatly appreciated. Additional technical details: done, _ = await asyncio.wait(requests_in_progress, return_when=asyncio.FIRST_COMPLETED) call still could have workers running when a new engine_step task for the VE is created. I'm not sure the exact interaction that causes the hanging, but inserting a Thanks |
Thanks for the thorough investigation and the fix! It's indeed true that there are existing issues with hanging on the current vLLM mainline, and I have not rebased on the latest PyNCCL changes yet. I also am unable to reproduce this issue easily with GPT2 when I try with my own testing. For these reasons I haven't investigated as deeply yet. I'll give your setup and fix a try once I check if multi-node is functional. I wonder if this is a similar reason as to why the TP-only cases are hanging in the issues mentioned above since there is no such |
FYI: I recently find clean up logic is prone to hang, and this is "fixed" in #4508 . |
@SolitaryThinker I tried the model/commands above that are giving you issues. I was unable to reproduce on my setup. My SetupStarted a fresh instance with the following: GCP g2-standard-48 (4 x NVIDIA L4) ExperimentsStarted vLLM with
Ran the below 3 times:
Killed vLLM server then repeated the above experiment 2 more times for a total of 3 separate serving instances, 9 benchmark tries, and 27 total requests sent. See expected benchmark results each time:
I wonder if it might only be reproducible on other instances... needs further investigation though. |
A very meaningful feature. Here is the command:
And here is error stack:
|
@zhengxingmao Thanks for reporting this! Does this happen without PP? If not, I think it could be some interaction with the following flags with PP. Can you try without these flags and use a model directly from HF? (LLaMa) |
I did some investigation into what you were saying. I think there are real hangs that appear. I tried LLaMa 3 8B with effectively infinite request rate on 2 L4s and saw hangs - not sure if this is the same situation that you found yourself in. Strangely, if I did a warm up request first, the hang went away. The
Also from here, asyncio methods such as I resolved a hang on my end with: Maybe this helps for you? |
Hi, @andoorve. I tried the codes on the I run the server on 2 x 4090:
The server runs successfully. Then I run the client as:
And the error occurs:
And the info shows that the running hangs. I am not sure if this is the same problem @SolitaryThinker met.
By the way, if changing |
Hey @darrenglow Thanks for trying it out. I tried as well and am not able to reproduce on 2xL4. Can you try it with python 3.10 if possible? |
Hi @andoorve, thanks for the feedback. I was also having the same issue as @darrenglow and using python10 fixed both the hanging and crashing I was experiencing. However vLLM hangs during CudaGraph capture when enabling PP+TP together without using the Example command and output:
Output:
Thanks |
Thanks for your reply. Switching to python 3.10 did help solve the problem. Now I also met the same problem as @SolitaryThinker points out.
|
Hi @SolitaryThinker, @darrenglow Thanks for your comments. My best guess right now is its probably related to the fact that this PR hasn't been rebased on the latest distributed/PyNCCL changes which we would need. However, CUDAGraph is a tricky thing to work with in general so we can't be sure. Currently I'm waiting on reviews before rebasing, at which point we can try again. We may still merge without CUDAGraph support though for TP + PP, especially since chunked prefill is eager. This is so that we have something functional in the mainline as soon as possible - this is TBA though. |
coros.append( | ||
worker.execute_method.remote( | ||
method, *args, **kwargs)) | ||
all_outputs = await asyncio.gather(*coros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by this loop. Please help with my understanding, thanks! @andoorve
Say if we have pp_size = 2
and tp_size=1
, we'll iterate over the two ranks from 0 to 1 (outer loop, inner loop diminishes).
On pp_rank=0, we will launch the execution of the first stage on its corresponding GPU (L355) and await its completion (L368). Once L368 returns, we will proceed to rank 1. The part I am confused, if in rank=0's execution, we launch a NCCL send
, this send
won't return unless we have rank=1 to launch its corresponding recv
, right? In this case, that L368 await
will never return hence block the loop to launch its corresponding recv
?
Please correct if I am wrong, appreciate your help! @andoorve !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followup:
I debugged a bit and did a simple experiment:
I file a single request to a model served with pp=2:
- On rank=0, I do not change any model code and just launch the NCCL send as your code did
- while on rank=1, I changed your code and do not launch a
recv
( but using fake values of hidden_states).
I added a few prints in your code and found that the send on rank=0 can still passes! Given that your send_to_next_rank
and recv_from_prev_rank
are indeed using the synchronous send/recv API, this is extremely suspicious that the current code might have messed up some send/recv pairings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhisbug
Thanks for your thorough investigation! I was also initially puzzled by this and did not expect this code to work - send/recv should be blocking and thus we should hang while waiting at L368 await
. At the time, I thought this might be due to a ray quirk - i.e. returning early somehow.
At the time I dismissed it because:
a) The output we get is correct
b) When I printed the sent/recv'd tensors those appeared to be correct
c) When I check the trace with nsys
I see send
and recv
matching up there.
However, what you are saying is true - I do see the send
ending before the recv
begins when I print timestamps. It needs a more in-depth look.
I tried your debugging method but I do see hangs when I do it this way which we would expect:
- On rank=0, I do not change any model code and just launch the NCCL send as your code did
- while on rank=1, I changed your code and do not launch a recv ( but using fake values of hidden_states).
May I know your exact modifications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you change your llama forward function to the following, in which I let the first profile_run
to complete a full send/recv by making a special case based on shape (in fact, I also found the profile run won't go through your raygpuexecutor code), but the following send/recv to be only partial -- only send, no recv.
def forward(
self,
input_ids: Optional[torch.Tensor],
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
inputs_embeds: Optional[torch.Tensor] = None,
) -> torch.Tensor:
if is_pipeline_model_parallel_first_rank():
if inputs_embeds is not None:
hidden_states = inputs_embeds
else:
hidden_states = self.get_input_embeddings(input_ids)
residual = None
else:
if inputs_embeds is not None:
sizes = list(inputs_embeds.size())
else:
sizes = list(input_ids.size()) + [self.config.hidden_size]
print(f"{sizes}")
if sizes[0] == 2048:
hidden_states, residual = recv_prev_rank(
2, torch.Size(sizes), self.embed_tokens.weight.dtype,
self.embed_tokens.weight.device)
else:
if inputs_embeds is not None:
hidden_states = inputs_embeds
else:
hidden_states = self.get_input_embeddings(input_ids)
residual = None
for i in range(self.start_layer, self.end_layer):
layer = self.layers[i]
hidden_states, residual = layer(
positions,
hidden_states,
kv_caches[i - self.start_layer],
attn_metadata,
residual,
)
Then you can launch the openai server and send one prompt using the following two commands:
CUDA_LAUNCH_BLOCKING=1 python -m vllm.entrypoints.openai.api_server --model JackFram/llama-160m --swap-space 16 --pipeline-parallel-size 2 --enforce-eager
python benchmark_serving.py --backend vllm --model JackFram/llama-160m --dataset-name sharegpt --dataset-path ~/sharegpt.json --num-prompts 1 --sharegpt-output-len 5
If you add a few prints at the end of the send API:
def send_next_rank(tensors: List[torch.Tensor]) -> None:
"""Send the tensors to the next pipeline model parallel rank."""
print(f"global rank {torch.distributed.get_rank()} sending...")
combined_tensor = torch.cat(tensors, dim=0)
torch.distributed.send(combined_tensor,
get_pipeline_model_parallel_next_rank(),
get_pipeline_model_parallel_group())
print(f"global rank {torch.distributed.get_rank()} sent done")
You will observe the first send
passes even I do not launch recv for it! After the first send/recv, the server will hang at the line which I print "global rank {torch.distributed.get_rank()} sending..." .
This is extremely strange because we do not launch a coressponding recv for the first send, how could?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I have a good idea of what is going on here, and why what we are still seeing the correct output even though send
and recv
here are not operating as we expect - but it's harder to figure out why that's happening.
My hypothesis is that torch.distributed.send
is enqueuing the correct send operation on the relevant CUDA stream. However, instead of blocking waiting for that NCCL operation to complete, it is instead returning immediately from the Python function. I.e. it is operating similar to the non-blocking isend
function which is not what we expect.
This is why you see the first send "go-through" (It is not actually sending just returning from the Python function) and blocking on the second send.
I modified communication_op.py
to include some print
statements and time.sleep
s.
def send_next_rank(tensors: List[torch.Tensor]) -> None:
"""Send the tensors to the next pipeline model parallel rank."""
combined_tensor = torch.cat(tensors, dim=0)
torch.cat(tensors, dim=0)
print (f'SEND STARTING {time.time()}', flush=True)
torch.distributed.send(combined_tensor,
get_pipeline_model_parallel_next_rank(),
get_pipeline_model_parallel_group())
print(f'SEND SUM: {combined_tensor.sum()}', flush=True)
print (f'SEND COMPLETED {time.time()}', flush=True)
time.sleep(5)
def recv_prev_rank(num_tensors: int, sizes: torch.Size, dtype: torch.dtype,
device: torch.device) -> List[torch.Tensor]:
sizes = list(sizes)
"""Receive tensors from the previous pipeline model parallel rank."""
combined_tensor = torch.empty([sizes[0] * num_tensors] + sizes[1:],
dtype=dtype,
device=device)
time.sleep(5)
print (f'RECV STARTING {time.time()}', flush=True)
torch.distributed.recv(combined_tensor,
get_pipeline_model_parallel_prev_rank(),
get_pipeline_model_parallel_group())
print(f'RECV SUM: {combined_tensor.sum()}', flush=True)
print (f'RECV COMPLETED {time.time()}', flush=True)
return torch.chunk(combined_tensor, num_tensors, dim=0)
This gives the following output (I also include print statements for entering and exiting an outer loop iteration in ray_gpu_executor.py
.
PP RANK 0 STARTED! at 1715983628.996715
SEND STARTING 1715983629.1642857
SEND SUM: -248.0
SEND COMPLETED 1715983629.1664162
PP RANK 0 DONE! at 1715983634.1697052
PP RANK 1 STARTED! at 1715983634.169776
(RayWorkerWrapper pid=2335407) RECV STARTING 1715983639.196885
(RayWorkerWrapper pid=2335407) RECV SUM: -248.0
(RayWorkerWrapper pid=2335407) RECV COMPLETED 1715983639.1983254
PP RANK 1 DONE! at 1715983639.2434692
Here, the time in the python portion of the send
function is completely disjoint from the time spent in the recv
portion (and PP rank 0 is done before the subsequent RECV is started). However, the checksums are consistent. This is consistent with what you saw in your experiment where the second send
was blocked by the first (since there was no matching recv
for the first send
) even though the Python function completed.
I tried to gather nsys
trace here for one more piece of evidence, but unfortunately was not able to get both the GPU traces for some reason.
Of course, although in this case this behaviour seems to be working in our favour it is not at all consistent with what we expect of torch.distributed.send
and torch.distributed.recv
which is to block the Python thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue, I also tried a small unit test and here the semantics are respected. That is, send
blocks its process until the recv
is completed.
import time
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def run(rank, size):
""" Distributed function to be implemented later. """
device = torch.device(f"cuda:{rank}")
if rank == 0:
tens = torch.ones([33, 4096], dtype=torch.bfloat16, device=device)
dist.send(tens, dst=1)
print(f'SEND COMPLETED {time.ctime()}', flush=True)
else:
tens = torch.empty([33, 4096], dtype=torch.bfloat16, device=device)
time.sleep(10)
print(f'ABOUT TO BEGIN RECV {time.ctime()}', flush=True)
dist.recv(tens, src=0)
print (f'{tens}', flush=True)
def init_process(rank, size, fn, backend='nccl'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29501'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 2
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, run, 'nccl'))
p.start()
processes.append(p)
for p in processes:
p.join()
I have been trying to see if there are some env variables that perhaps Ray or vLLM might have set to change PyTorch's behaviour but not successful so far.
39c6019
to
8513174
Compare
This is rebased now, you can try this. |
@andoorve thank you for rebasing! I am still seeing the same error from CudaGraph when using both PP and PP+TP. What is your pytorch version? I did a clean install using
Please see the following error (output from other ranks omitted, but they all fail at cudagraph)
|
This is due to PyTorch changes in 2.3.0, see: pytorch/pytorch#120270 You can workaround this with a quick fix by changing |
@andoorve Great, thanks for the quick fix. Your pynccl changes are identical to mine so that's reassuring. |
cf2d32f
to
4435dc3
Compare
commit 921bb1a014d435089db634fea9451b8c9f945459 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 22 02:28:47 2024 +0000 Add back driver worker arg Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 39c6019865192737ce3cd09c50d13db2a32e1ca5 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Thu May 9 00:22:12 2024 +0000 Test fix Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit b60f7ea8779ae5e35c68868f327569df2167b88f Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 8 04:54:52 2024 +0000 Refactoring and test fixes Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 7e993601f47e68afe31b30ac66f9252956ce58c9 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 8 00:22:33 2024 +0000 Formatting Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 2091dd91d06070d1db0f82670e82120d5f7ad5f4 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 21:48:21 2024 +0000 Basic PP tests Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 016e25664434dc6f63eed9526e5982048757d7a2 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 20:40:54 2024 +0000 Formatting Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit ee86cd204666eab815e42be703c5f434c41af255 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 20:40:36 2024 +0000 Fix condition for PP support Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit df9b0c45cee14395b2b2dff9c4e3343ab2a019a1 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 18:16:55 2024 +0000 Fix hangs Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 2180531ed5592d49cfa7492cebc92269693094ee Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 04:01:29 2024 +0000 Fix typo Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit a17fcfe02c820f7b83bdcc3704059fcb35a231b8 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 01:50:24 2024 +0000 Assert out model architectures that are unsupported Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit f784fda224144f82065c19c643912390ab29b849 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue May 7 01:17:33 2024 +0000 More test fixes Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 04b5fe903ac4598b5337d457afd684426e384690 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Sun May 5 17:28:42 2024 +0000 Change condition for prepare_input_tensors to broadcast Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 526bade032dbeba73f6523009701f8a5f4b222f9 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Sun May 5 17:14:48 2024 +0000 Fixed bug with TP + PP execution Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 9d698fa Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Thu May 2 18:38:41 2024 +0000 Format and test changes Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 16a5aac Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Thu May 2 18:30:46 2024 +0000 Format and test changes Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 65a5300 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 1 06:42:13 2024 +0000 Simplify weight loading logic Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit daddc19 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 1 05:56:53 2024 +0000 Formatting Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 1be32c8 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 1 05:55:12 2024 +0000 Revert "PyNCCL changes" This reverts commit 99bb187. commit 99bb187 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Wed May 1 05:29:42 2024 +0000 PyNCCL changes Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit bd12e70 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Tue Apr 30 22:46:12 2024 +0000 Fixed testing errors Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit fbb2b2e Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Sat Apr 27 08:48:36 2024 +0000 Formatting Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> commit 06609d9 Author: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Date: Sat Apr 27 08:39:03 2024 +0000 Pipeline Parallel Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
ae0d1ae
to
69166a1
Compare
Not sure if you are aware, but it seems that one of your pushes today broke PP. I am seeing empty responses from the api_server. With or without cudagraph for both tp-only and tp+pp. The last time that your branch worked for me was when you sent this message above.
Below is a small script for openai client I have been using to check for correctness. Hopefully it will be of use. from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
ps = ["Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"] * 10
# Completion API
stream = False
completion = client.completions.create(
model=model,
prompt=ps,
echo=False,
temperature=0,
stream=stream)
print("Completion results:")
if stream:
for c in completion:
print(c)
else:
print(completion)
for idx, c in enumerate(completion.choices):
print('Prompt:', ps[idx])
print('Decode:', c.text) Output on latest commit (meta-llama/Llama-2-13b-hf):
expected output (meta-llama/Llama-2-13b-hf):
|
@SolitaryThinker thanks for letting me know - let me take a look. |
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
It was passing GPT2 but not LLaMa. Missed a line when rebasing a LLaMa change which I added back. Should pass your script now @SolitaryThinker |
Adds initial pipeline parallelism support to vLLM.
ToDo:
Milestone 1: POC Prototype
worker.py
,llm_engine.py
,async_llm_engine.py
and block managers.ray_gpu_executor.py
,worker.py
andmodel_runner.py
to support multiple driver workersMilestone 2: Mergeable
FIX #4461
Goals for this PR:
Non-goals for this PR (To be covered in future PRs)
cc: @zhuohan123 @WoosukKwon @simon-mo @youkaichao