Adds batched inference with left-padding #886

FlimFlamm · 2024-01-17T23:43:31Z

Adds a left-padding batched inference strategy by modifying generate/base.py and model.py

The only API change was to rename prompt to prompts in generate.py's main() function; it's still compatible with a string, and now with a list of strings
Under the right conditions (eg: batch size 64 and max_tokens=200 (something that wont overflow my 4090 at that batch size), and I can get just about 2000 tokens/second on mistral 7b. On stablem 3B i can get 6000 tokens per second (with batch size in the hundreds)
I added some ultra-light formatting and color to the decoding and printing of the generated resuts. Color/light is lit! (shown in vid)

2024-01-17.19-35-56.mp4

EDIT: currently triage'ing the test fails

Constructive feedback is very welcome. If something about this commit would adversely affect other parts of the repo that I have overlooked, I'll do my best to address it.

FlimFlamm · 2024-01-18T02:42:23Z

I think the tests might need some tweaking (otherwise i might have broken them with a few to many pushes :D)

Will leave this as is for review, and will be back ASAP to carry out any fixes that might be desirable or necessary.

carmocca · 2024-01-19T01:47:19Z

Hi @FlimFlamm! Thanks for working on this.

I had this partially implemented but never pushed it and I might have lost it because I cannot find it in my stashes 💀.

I'll sit on this for a bit and perhaps merge what you have with what I had. This will need tests and some performance benchmarking before landing.

FlimFlamm · 2024-01-19T03:51:45Z

Hi @FlimFlamm! Thanks for working on this.

I had this partially implemented but never pushed it and I might have lost it because I cannot find it in my stashes 💀.

I'll sit on this for a bit and perhaps merge what you have with what I had. This will need tests and some performance benchmarking before landing.

Happy to be of help!

For a relatively small change, it definitely affects computation in a lot of scripts (anything that touches generate), which includes things like generate/lora.py, generate/adapter.py, etc, so this is definitely one for careful review. Here are some notes/considerations I have so far in hindsight:

It's important to ensure that the ROPE application doesn't need to be modified for the left-padding case; left padding changes the way the rope applies by default (in the case of left padding in a given sequence, the position 1 of the ROPE wont be on the BOS token of the sequence). Outputs look good despite this, but they might be getting harmed by the difference, and some models might be more brittle in this situation.
A second important consideration is whether or not any extra masking needs to occur. (masking the left padding tokens). My implementation pads with 0's; originally I was masking them out inside of model.py's forward pass (modifying the mask that is loaded from the cache according to current left padding), but it didn't seem to result in any difference in outputs, so i removed it in the interest of not modifying model.py. Possibly the models I was testing with effectively ignore 0 tokens due to training/config dynamics, but other models might have issues. (since not all models have an explicit padding token, I'm not sure how this applies to all cases all cases )
Right padding might be a better alternative assuming the ROPE issue with left padding leads to complications for model.py. I originally tried implementing right padding but decided to do left padding because it seemed to require less alterations, but this might not be true if there's a ROPE issue.
I did some testing with generate/lora/adapter/sequentially, and the same left padding logic seems to work without issue (just requires the modifications found in generate-main(). Was able to do multi-gpu batched inference with mixtral!

I'm gonna tinker and test some more (hopefully to see if right padding an be more easily cinched in in case that turns out to be important for model performance)

…erate/base.py

FlimFlamm · 2024-01-21T00:33:11Z

Pushed some additions and changes that seemed sensical or cleaner. Made a simple padding function for utils (can do left and right padding), and set up the mask cache to optionally take a padding mask (can be passed when the kv cache is being set, or directly to build_mask_cache).

Also set up the same logic in sequentially.py for testing (seems to work great).

Finally I also added optional attention masking for the forward pass of model.py's GPT class (which isn't required, but seems like it would be useful for anyone using special masking)

NOTE: the masking strategy that bakes a batches left/right padding into the mask cache results in the mask cache being increased by a factor of batch_size (since we need unique padding inside each sequence's mask), but by doing so we dont have to do any tensor work during generation. In theory if the max sequence length explodes, this strategy loses its edge (because it quadratically scales the auto-regressive mask itself), which might make the batch_size factor start to hurt.

WilliamGazeley · 2024-01-24T13:48:35Z

Thanks for working on this @FlimFlamm, I was working on this functionality on my fork as well but the kv cache issue is a tricky one.

I cloned your repo and tried to run generation on stablelm and TinyLlama, but both produced outputs that were jibberish. I didn't make any changes to your code, any idea what could be going on?

FlimFlamm · 2024-01-24T22:49:45Z

Thanks for working on this @FlimFlamm, I was working on this functionality on my fork as well but the kv cache issue is a tricky one.

I cloned your repo and tried to run generation on stablelm and TinyLlama, but both produced outputs that were jibberish. I didn't make any changes to your code, any idea what could be going on?

Can I ask exactly what method or CLI arg you used to test? Will try to reproduce and see if i can find the issue.

WilliamGazeley · 2024-01-25T13:11:15Z

I just did the following:

python scripts/download.py --repo_id 'TinyLlama/TinyLlama-1.1B-Chat-v1.0' --from_safetensors 1
python scripts/convert_hf_checkpoint.py --checkpoint_dir 'checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'
python generate/base.py --checkpoint_dir 'checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'

I set prompts = ["what food do llamas eat?"] and I get outputs that keep repeating words.

FlimFlamm · 2024-01-25T22:01:25Z

I just did the following:

python scripts/download.py --repo_id 'TinyLlama/TinyLlama-1.1B-Chat-v1.0' --from_safetensors 1
python scripts/convert_hf_checkpoint.py --checkpoint_dir 'checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'
python generate/base.py --checkpoint_dir 'checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'

I set prompts = ["what food do llamas eat?"] and I get outputs that keep repeating words.

Awesome, thanks for the details.

Editing...

FlimFlamm · 2024-01-26T03:18:25Z

I just did the following:

python scripts/download.py --repo_id 'TinyLlama/TinyLlama-1.1B-Chat-v1.0' --from_safetensors 1
python scripts/convert_hf_checkpoint.py --checkpoint_dir 'checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'
python generate/base.py --checkpoint_dir 'checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'

I set prompts = ["what food do llamas eat?"] and I get outputs that keep repeating words.

So I found the problem; I was building my mask incorrectly. recent push should have the replacement build_mask_cache() function. Only other necessary change was to use right padding instead of left padding, because having corrected the mask i started re-encountering the NaN issue described here pytorch/pytorch#103749

The exact cause of the problem are cases where an entire line of the causal attention mask is "False", which screws with the dot product attention. The fix is apparently common in a lot of repos. Ours would be something like:

    def scaled_dot_product_attention(
        self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        scale = 1.0 / math.sqrt(self.config.head_size)
        y = torch.nn.functional.scaled_dot_product_attention(
            q, k, v, attn_mask=(1.0 - mask.to(dtype=q.dtype)) * -10000.0, scale=scale, is_causal=mask is None
        )
        return y.transpose(1, 2)

Instead of doing that, I just switched base.py to right padding, but the more i test, the more it looks like the above is a correct way to address the all False problem.

Another legitimate fix for this particular model seems to just be using the padding token that is assigned, and not using a padding mask at all. Whether or not the model itself defines a padding token might be an indicator that no extra masking is required for the padding...

Do let me know if the last push makes batched inference work on your end!

FlimFlamm · 2024-01-26T03:37:35Z

Applying the large-negative number fix seems to have done the trick; left and right padding now are both equivalent in terms of output for the tinyllama chat modell.

WilliamGazeley · 2024-01-26T19:04:25Z

Running benchmarks on TinyLlama using the original generate code vs your batch implemetation yields (almost) identical scores now. Well done!

Update
Spoke too soon.. when batching 10 prompts at a time, the score is close to the original code, however batching 25 results in a 10% drop in hellaswag perf.

FlimFlamm · 2024-01-26T20:40:10Z

Running benchmarks on TinyLlama using the original generate code vs your batch implemetation yields (almost) identical scores now. Well done!

Update Spoke too soon.. when batching 10 prompts at a time, the score is close to the original code, however batching 25 results in a 10% drop in hellaswag perf.

Very interesting. A few questions/requests that might help me replicate/track this issue down:

Are all your prompts in a given batch the same? (If so, then i can narrow down some potential causes)

1.1) If not, try turning off the padding mask (just dont pass it into set_kv) (EoS is 0 according to the generation config).

1.2) Try using left and right padding (change the parameter from "left" to "right" in base.py where pad_batched_tokens is called)
Are you using a particular prompt formatting, like <|user|> and <|assistant|>? (afaik this is what tinyllama 1b chat is trained for)
Do the results of prompts in a batch's index 0 change for you compared to non-batched inference or the old generate code? (i dont think they should, but am wondering if performance degrades uniformly for all sequences in a batch)

I also wonder if the large-negative-number fix might not be ideally implemented here; im only using negative 10k (most implementations used torch.finfo(dtype).min, but this still resulted in NaN's for me (although maybe i omitted an additional related change))...

Possibly this performance hit is a consequence of batched inference in and of itself? Can't find much about it but maybe?

WilliamGazeley · 2024-01-27T07:02:58Z

Promts within a batch are different, but the prompts passed to both runs are the same (i.e. single and batched receive the same prompts)
1.1) Turning off the padding mask destroys the benchmark performance
1.2) Right padding also results in bad performance
1.3) Right padding + no mask results in bad performance (expected, but being thorough)
You're right, I'm not using the correct format, but this shouldn't matter because both single and batch generation are not using the format - scores should be the same
The inputs for single (batch_size = 1) and batched (batch_size > 1) are identical, but oddly the outputs are not. Temperature is 0 and random was seeded.

This branch and the upstream are starting to diverge, I'm going to copy your changes into my fork that's up-to-date and continue to dig around.

What are you getting on your end, is doing 10 prompts in a batch the same as the same 10 prompts one at a time?

FlimFlamm · 2024-01-28T01:03:03Z

@WilliamGazeley Thanks for the effort on this!

What are you getting on your end, is doing 10 prompts in a batch the same as the same 10 prompts one at a time?

Batched inference does give different outputs for each sequence, which I think is by design. The good news is that the first sequence in the batch is the same as our single unbatched case along with original generate/mask code.

You're right, I'm not using the correct format, but this shouldn't matter because both single and batch generation are not using
the format - scores should be the same

I agree, although assuming there is some small unavoidable performance loss in batched inference cases, I was thinking that an input being more out of its training distribution could amplify the performance degradation. (since at 1B this model is relatively brittle, perhaps that also magnifies the issues we're seeing re: performance)

I'll keep poking at it as well to see what I can come up with (will fire up hellaswag soon as i can top replicate your findings and start hunting from there).

WilliamGazeley · 2024-01-29T21:31:20Z

Playing around further, I noticed that there's a huge difference in outputs if you change between bf16 and 16-true. This is somewhat expected I guess, but the batched 16-true is closer to the single bf16 than the batched bf16 is to the single bf16 - this is only on my benchmark though.

Also, I think your implementation of scaled_dot_product_attention() breaks training scripts that do not pass input_pos when generating logits. I've updated the function to:

def scaled_dot_product_attention(
        self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        scale = 1.0 / math.sqrt(self.config.head_size)
        y = torch.nn.functional.scaled_dot_product_attention(
            q, 
            k, 
            v, 
            attn_mask=None if mask is None else (1.0 - mask.to(dtype=q.dtype)) * -10000.0, 
            dropout_p=0.0, 
            scale=scale, 
            is_causal=mask is None
        )
        return y.transpose(1, 2)

Adds batched inference with left-padding

f0355b4

FlimFlamm requested review from awaelchli, carmocca and lantiga as code owners January 17, 2024 23:43

FlimFlamm added 16 commits January 17, 2024 20:21

fixed prompt dtype oversight

7d721e9

removing unnecessary ROPE mods

9c5e268

restoring default prompt

efa8585

removes now unnecessary forward pass mods

c606616

addressing test fails in sample()

d0c1c1e

re-applyig batched sampling

1159626

tweaking generate to only return batched sequences if input was batched

f185582

fixing incorect operation in generate

eac67fc

re-correcting operation in generate

5a5c9fd

tensorfying eos id to keep a tensor

b71159c

dtype inference fix

b60e234

attempt to fix text mock eos condition test fail

94f10e6

attempt to fix text mock eos condition test fail

4ee78af

attempt to fix text mock eos condition test fail

4dd98cb

almost there; eos fix

e5aba00

removed errant line from generate

48667f6

eliminating diff with model.py

8efbe22

Adds padding mask functionality to model.py, sequentyally.py, and gen…

b1635d7

…erate/base.py

FlimFlamm added 2 commits January 21, 2024 13:48

Reversion/correction to base.py

121445e

small reverts/improvements for base.py

18e35f7

FlimFlamm added 2 commits January 25, 2024 19:33

fixes broadcastig error

c79ff7f

switching to default right padding

fbfcd40

fixed scaled dot product attention

b66f20c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds batched inference with left-padding #886

Adds batched inference with left-padding #886

FlimFlamm commented Jan 17, 2024 •

edited

FlimFlamm commented Jan 18, 2024

carmocca commented Jan 19, 2024

FlimFlamm commented Jan 19, 2024 •

edited

FlimFlamm commented Jan 21, 2024 •

edited

WilliamGazeley commented Jan 24, 2024

FlimFlamm commented Jan 24, 2024

WilliamGazeley commented Jan 25, 2024 •

edited

FlimFlamm commented Jan 25, 2024 •

edited

FlimFlamm commented Jan 26, 2024 •

edited

FlimFlamm commented Jan 26, 2024 •

edited

WilliamGazeley commented Jan 26, 2024 •

edited

FlimFlamm commented Jan 26, 2024 •

edited

WilliamGazeley commented Jan 27, 2024 •

edited

FlimFlamm commented Jan 28, 2024

WilliamGazeley commented Jan 29, 2024 •

edited

Adds batched inference with left-padding #886

Are you sure you want to change the base?

Adds batched inference with left-padding #886

Conversation

FlimFlamm commented Jan 17, 2024 • edited

FlimFlamm commented Jan 18, 2024

carmocca commented Jan 19, 2024

FlimFlamm commented Jan 19, 2024 • edited

FlimFlamm commented Jan 21, 2024 • edited

WilliamGazeley commented Jan 24, 2024

FlimFlamm commented Jan 24, 2024

WilliamGazeley commented Jan 25, 2024 • edited

FlimFlamm commented Jan 25, 2024 • edited

FlimFlamm commented Jan 26, 2024 • edited

FlimFlamm commented Jan 26, 2024 • edited

WilliamGazeley commented Jan 26, 2024 • edited

FlimFlamm commented Jan 26, 2024 • edited

WilliamGazeley commented Jan 27, 2024 • edited

FlimFlamm commented Jan 28, 2024

WilliamGazeley commented Jan 29, 2024 • edited

FlimFlamm commented Jan 17, 2024 •

edited

FlimFlamm commented Jan 19, 2024 •

edited

FlimFlamm commented Jan 21, 2024 •

edited

WilliamGazeley commented Jan 25, 2024 •

edited

FlimFlamm commented Jan 25, 2024 •

edited

FlimFlamm commented Jan 26, 2024 •

edited

FlimFlamm commented Jan 26, 2024 •

edited

WilliamGazeley commented Jan 26, 2024 •

edited

FlimFlamm commented Jan 26, 2024 •

edited

WilliamGazeley commented Jan 27, 2024 •

edited

WilliamGazeley commented Jan 29, 2024 •

edited