how are the samples picked when `limit_train_batches` < 1.0? #10672

miccio-dk · 2021-11-22T15:27:10Z

miccio-dk
Nov 22, 2021

Hi! I'm trying to understand the behavior of limit_train_batches (and its valid and test analogues). In particular, I'm curious about whether the subset of batches is computed only once at the beginning of the training or re-sampled at every epoch.

Say I have a dataset composed of N samples, and I want to train my model on a subset that comprises only M samples (doens't matter how this subset is chosen, as long as it stays the same across epochs). Is it sufficient to set limit_train_batches=(M // batch_size) or do I have to manually limit the size of my dataset somehow?

Thanks in advance!

rohitgr7 · 2021-11-22T16:35:57Z

rohitgr7
Nov 22, 2021

It doesn't subset the samples. It just do this internally:

for batch_idx, batch in enumerate(dataloader):
    # do something
    if batch_idx == limit_train_batches:
        break

so how the data is sampled depends upon Dataloader(shuffle=?).
Whether the subsampling happens every epoch depends upon Trainer(reset_dataloaders_every_n_epochs=?) (by default for each epoch you will get same sequence of samples)

7 replies

yc1999 Mar 2, 2022

@mariomeissner @rohitgr7 I experimented when Dataloder(shuffle=True), and I found that on epoch 0 and epoch 1, they are different in batch 0 ... Hope @rohitgr7 can give more suggestions ...

rohitgr7 Mar 2, 2022

Is there any reason the parameter isn't called limit_train_batches_per_epoch? Because that would, if I have understood it correctly, help a lot with readability.

@shabie this was decided a long time ago and might not be a good idea to change now. Also, it's too verbose plus all of these flags correspond to each dataloader iteration i.e each epoch in an ideal case.

Even if Dataloader(shuffle=True), by default the samples taken every epoch are the same, is that correct?

@mariomeissner nope. They will be different as @yc1999 mentioned. That's the whole point of shuffling i.e enabling randomness. If it's going to be the same in each epoch, then it will be equivalent to just shuffling data initially only once and disabling randomness.

mariomeissner Apr 9, 2022

@rohitgr7 It makes sense, but it depends on how we understand the selection step. It could have been applied before the shuffle, in which case it effectively means that we have always the same subset, but it is shuffled every epoch (same samples, different order). Apparently what we do is re-select at every epoch, yielding different samples.

That's why people are proposing the "_per_batch" name, to clarify that every batch the selection will be different.

Personally, the "same samples, different order" approach would have been more useful for my use cases, since what I needed is simply to reduce the training set size (train a model on a reduced training set).

gauenk Sep 17, 2022

I am another vote for the "_per_batch" specification. The current language is vague enough to be either case discussed here.

RylanSchaeffer Apr 22, 2024

I read this and I"m still confused. What is the answer? It sounds like limit_train_batches does not guarantee that the same sample indices will be used?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how are the samples picked when `limit_train_batches` < 1.0? #10672

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

how are the samples picked when limit_train_batches < 1.0? #10672

miccio-dk Nov 22, 2021

Replies: 1 comment · 7 replies

rohitgr7 Nov 22, 2021

yc1999 Mar 2, 2022

rohitgr7 Mar 2, 2022

mariomeissner Apr 9, 2022

gauenk Sep 17, 2022

RylanSchaeffer Apr 22, 2024

how are the samples picked when `limit_train_batches` < 1.0? #10672

miccio-dk
Nov 22, 2021

Replies: 1 comment 7 replies

rohitgr7
Nov 22, 2021