Generate with KV-cache enabled vs. not enabled gives different results #959

joecummings · 2024-05-10T19:14:51Z

We would expect that the only different between enabling a kv-cache for a model in generation is the speed of decoding; however, in experiments with commenting out with device: model.setup_caches() in our generate.py recipe, the output is garbage.

Needs more investigation.

The text was updated successfully, but these errors were encountered:

rohan-varma · 2024-05-10T21:05:12Z

You might need to change the incremental_decode in the generation function?

calvinpelletier · 2024-05-15T19:07:19Z

@joecummings I'm guessing this is because the causal mask is created in setup_caches() here, so without calling this function we're attending to all tokens, resulting in garbage outputs. Maybe we should move this mask initialization into __init__?

Nevermind, this line takes care of the causal mask if it's missing.

joecummings self-assigned this May 10, 2024

joecummings linked a pull request May 13, 2024 that will close this issue

Add support for non-incremental decoding + unit test #973

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate with KV-cache enabled vs. not enabled gives different results #959

Generate with KV-cache enabled vs. not enabled gives different results #959

joecummings commented May 10, 2024

rohan-varma commented May 10, 2024

calvinpelletier commented May 15, 2024 •

edited

Generate with KV-cache enabled vs. not enabled gives different results #959

Generate with KV-cache enabled vs. not enabled gives different results #959

Comments

joecummings commented May 10, 2024

rohan-varma commented May 10, 2024

calvinpelletier commented May 15, 2024 • edited

calvinpelletier commented May 15, 2024 •

edited