Why memory increases during training #2602

miloskovacevic68 · 2024-04-19T19:33:30Z

Hello,
I have an (anchor, positive) unlabeled dataset with around 250,000 examples.
Here is the code I use to fine-tune the sentence-transformers/multi-qa-mpnet-base-cos-v1 model on
a subset of MS Marco dataset:

def create_anchor_positive_unlabeled_set(marko_csv): 
    training_set = []
    with open(marko_csv) as f:
        for l in f:
            try:
                qaid, q, pos, neg, ans = l.split("\t")
                training_set.append(InputExample(texts=[q, pos]))
            except:
                pass
    
    return training_set

def finetune_with_mnr_loss(training_set, model, output_dir, batch_size, epochs, max_seq_len):   
    model.max_seq_length = max_seq_len
    train_dataloader = DataLoader(
        training_set,
        shuffle=True,
        batch_size=batch_size
    )
    
    train_loss = losses.MultipleNegativesRankingLoss(model=model)
    model.fit(
        [(train_dataloader, train_loss)],
        epochs=epochs,
        output_path=output_dir,
        show_progress_bar=False,
        use_amp=True
    )

finetune_with_mnr_loss(
    training_set=create_anchor_positive_unlabeled_set(
        "datasets/ms_marco_sr.csv"
    ),
    model=SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-cos-v1", device="cuda"),
    output_dir="models/multi_qa_cos_big",
    batch_size=100,
    epochs=10,
    max_seq_len=256
)

When the training starts, 22 of 24GB of my VRAM is consumed. The memory consumption increases during iterations and at the very end of the first epoch I got the Out of Memory error.

I then tried to use Data Loader with IterableDataset but the result is the same. Why the memory increases towards the end of the epoch and how to fine tune this model?

Regards, Milos

The text was updated successfully, but these errors were encountered:

ir2718 · 2024-04-22T09:26:24Z

Hi,

Not sure why the memory consumption increases, although you can just lower the batch size (eg. to 64) and it should lower the memory consumption as well.

tomaarsen · 2024-04-22T09:35:41Z

Hello!

I'm also not quite sure, but I have noticed that sometimes the memory usage can increase. The reasoning is that during tokenization, we pad to the largest sample in the batch, up to the maximum sequence length. So, every time that you encounter a batch with a text that is longer than any text from any previous batch, then the memory usage goes up. After all, it has to put more values for that batch on the GPU.

So, if you reached a particularly long text near the end of the training loop, this can result in a memory usage spike.

In short, if a text from one of your batches exceeds the maximum sequence length, then the batch will be as big as it can possibly be. That will be the maximum memory usage that the training should take.

Tom Aarsen

miloskovacevic68 · 2024-04-22T20:59:54Z

Hi,

Not sure why the memory consumption increases, although you can just lower the batch size (eg. to 64) and it should lower the memory consumption as well.

I would like to have a larger batch size, it seems that the models are better in that case.
I'll try with CachedMultipleNegativesRankingLoss that allows for larger batches using smaller mini batches .

Is it possible to train the model on two GPUs?

miloskovacevic68 · 2024-04-22T21:02:03Z

Hello!

I'm also not quite sure, but I have noticed that sometimes the memory usage can increase. The reasoning is that during tokenization, we pad to the largest sample in the batch, up to the maximum sequence length. So, every time that you encounter a batch with a text that is longer than any text from any previous batch, then the memory usage goes up. After all, it has to put more values for that batch on the GPU.

So, if you reached a particularly long text near the end of the training loop, this can result in a memory usage spike.

In short, if a text from one of your batches exceeds the maximum sequence length, then the batch will be as big as it can possibly be. That will be the maximum memory usage that the training should take.
* Tom Aarsen

It makes sense. Thanks.

tomaarsen · 2024-04-24T10:03:18Z

Is it possible to train the model on two GPUs?

Only via #2449 at this point. This PR will be merged and released as Sentence Transformers v3 soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why memory increases during training #2602

Why memory increases during training #2602

miloskovacevic68 commented Apr 19, 2024

ir2718 commented Apr 22, 2024

tomaarsen commented Apr 22, 2024

miloskovacevic68 commented Apr 22, 2024

miloskovacevic68 commented Apr 22, 2024

tomaarsen commented Apr 24, 2024

Why memory increases during training #2602

Why memory increases during training #2602

Comments

miloskovacevic68 commented Apr 19, 2024

ir2718 commented Apr 22, 2024

tomaarsen commented Apr 22, 2024

miloskovacevic68 commented Apr 22, 2024

miloskovacevic68 commented Apr 22, 2024

tomaarsen commented Apr 24, 2024