Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why memory increases during training #2602

Open
miloskovacevic68 opened this issue Apr 19, 2024 · 5 comments
Open

Why memory increases during training #2602

miloskovacevic68 opened this issue Apr 19, 2024 · 5 comments

Comments

@miloskovacevic68
Copy link

Hello,
I have an (anchor, positive) unlabeled dataset with around 250,000 examples.
Here is the code I use to fine-tune the sentence-transformers/multi-qa-mpnet-base-cos-v1 model on
a subset of MS Marco dataset:

def create_anchor_positive_unlabeled_set(marko_csv): 
    training_set = []
    with open(marko_csv) as f:
        for l in f:
            try:
                qaid, q, pos, neg, ans = l.split("\t")
                training_set.append(InputExample(texts=[q, pos]))
            except:
                pass
    
    return training_set

def finetune_with_mnr_loss(training_set, model, output_dir, batch_size, epochs, max_seq_len):   
    model.max_seq_length = max_seq_len
    train_dataloader = DataLoader(
        training_set,
        shuffle=True,
        batch_size=batch_size
    )
    
    train_loss = losses.MultipleNegativesRankingLoss(model=model)
    model.fit(
        [(train_dataloader, train_loss)],
        epochs=epochs,
        output_path=output_dir,
        show_progress_bar=False,
        use_amp=True
    )

finetune_with_mnr_loss(
    training_set=create_anchor_positive_unlabeled_set(
        "datasets/ms_marco_sr.csv"
    ),
    model=SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-cos-v1", device="cuda"),
    output_dir="models/multi_qa_cos_big",
    batch_size=100,
    epochs=10,
    max_seq_len=256
)

When the training starts, 22 of 24GB of my VRAM is consumed. The memory consumption increases during iterations and at the very end of the first epoch I got the Out of Memory error.

I then tried to use Data Loader with IterableDataset but the result is the same. Why the memory increases towards the end of the epoch and how to fine tune this model?

Regards, Milos

@ir2718
Copy link
Contributor

ir2718 commented Apr 22, 2024

Hi,

Not sure why the memory consumption increases, although you can just lower the batch size (eg. to 64) and it should lower the memory consumption as well.

@tomaarsen
Copy link
Collaborator

Hello!

I'm also not quite sure, but I have noticed that sometimes the memory usage can increase. The reasoning is that during tokenization, we pad to the largest sample in the batch, up to the maximum sequence length. So, every time that you encounter a batch with a text that is longer than any text from any previous batch, then the memory usage goes up. After all, it has to put more values for that batch on the GPU.

So, if you reached a particularly long text near the end of the training loop, this can result in a memory usage spike.

In short, if a text from one of your batches exceeds the maximum sequence length, then the batch will be as big as it can possibly be. That will be the maximum memory usage that the training should take.

  • Tom Aarsen

@miloskovacevic68
Copy link
Author

Hi,

Not sure why the memory consumption increases, although you can just lower the batch size (eg. to 64) and it should lower the memory consumption as well.

I would like to have a larger batch size, it seems that the models are better in that case.
I'll try with CachedMultipleNegativesRankingLoss that allows for larger batches using smaller mini batches .

Is it possible to train the model on two GPUs?

@miloskovacevic68
Copy link
Author

Hello!

I'm also not quite sure, but I have noticed that sometimes the memory usage can increase. The reasoning is that during tokenization, we pad to the largest sample in the batch, up to the maximum sequence length. So, every time that you encounter a batch with a text that is longer than any text from any previous batch, then the memory usage goes up. After all, it has to put more values for that batch on the GPU.

So, if you reached a particularly long text near the end of the training loop, this can result in a memory usage spike.

In short, if a text from one of your batches exceeds the maximum sequence length, then the batch will be as big as it can possibly be. That will be the maximum memory usage that the training should take.

* Tom Aarsen

It makes sense. Thanks.

@tomaarsen
Copy link
Collaborator

Is it possible to train the model on two GPUs?

Only via #2449 at this point. This PR will be merged and released as Sentence Transformers v3 soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants