Fix iter_batches #5115

lhoestq · 2022-10-14T12:06:14Z

The pa.Table.to_reader() method available in pyarrow>=8.0.0 may return chunks of size < max_chunksize, therefore iter_batches can return batches smaller than the batch_size specified by the user

Therefore batched map couldn't always use batches of the right size, e.g. this fails because it runs only on one batch of one element:

from datasets import Dataset, concatenate_datasets

ds = concatenate_datasets([Dataset.from_dict({"a": [i]}) for i in range(10)])

ds2 = ds.map(lambda _: {}, batched=True)
assert list(ds2) == list(ds)

This was introduced in #5030

Close #5111

This will require a patch release along with #5113

TODO:

fix tests
add more tests

HuggingFaceDocBuilderDev · 2022-10-14T12:14:03Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-10-14T13:39:46Z

I also ran the code in #5111 and it works fine now :)

lhoestq · 2022-10-14T14:04:47Z

This is ready for review :)

albertvillanova

Thanks for the fix.

Just a few comments below.

src/datasets/table.py

albertvillanova · 2022-10-14T14:15:52Z

src/datasets/table.py

+    chunks_buffer_size = 0
+    for chunk in pa_table.to_reader(max_chunksize=batch_size):
+        if len(chunk) == 0:
+            continue
+        elif chunks_buffer_size + len(chunk) < batch_size:
+            chunks_buffer.append(chunk)
+            chunks_buffer_size += len(chunk)
+            continue
+        elif chunks_buffer_size + len(chunk) == batch_size:
+            chunks_buffer.append(chunk)
+            yield pa.Table.from_batches(chunks_buffer)
+            chunks_buffer = []
+            chunks_buffer_size = 0
+        else:
+            cropped_chunk_length = batch_size - chunks_buffer_size
+            chunks_buffer.append(chunk.slice(0, cropped_chunk_length))
+            yield pa.Table.from_batches(chunks_buffer)
+            chunks_buffer = [chunk.slice(cropped_chunk_length, len(chunk) - cropped_chunk_length)]
+            chunks_buffer_size = len(chunk) - cropped_chunk_length


Maybe we can remove the variable chunks_buffer_size:

Suggested change

chunks_buffer_size = 0

for chunk in pa_table.to_reader(max_chunksize=batch_size):

if len(chunk) == 0:

continue

elif chunks_buffer_size + len(chunk) < batch_size:

chunks_buffer.append(chunk)

chunks_buffer_size += len(chunk)

continue

elif chunks_buffer_size + len(chunk) == batch_size:

chunks_buffer.append(chunk)

yield pa.Table.from_batches(chunks_buffer)

chunks_buffer = []

chunks_buffer_size = 0

else:

cropped_chunk_length = batch_size - chunks_buffer_size

chunks_buffer.append(chunk.slice(0, cropped_chunk_length))

yield pa.Table.from_batches(chunks_buffer)

chunks_buffer = [chunk.slice(cropped_chunk_length, len(chunk) - cropped_chunk_length)]

chunks_buffer_size = len(chunk) - cropped_chunk_length

for chunk in pa_table.to_reader(max_chunksize=batch_size):

if len(chunk) == 0:

continue

elif len(chunks_buffer) + len(chunk) < batch_size:

chunks_buffer.append(chunk)

continue

elif len(chunks_buffer) + len(chunk) == batch_size:

chunks_buffer.append(chunk)

yield pa.Table.from_batches(chunks_buffer)

chunks_buffer = []

else:

cropped_chunk_length = batch_size - len(chunks_buffer)

chunks_buffer.append(chunk.slice(0, cropped_chunk_length))

yield pa.Table.from_batches(chunks_buffer)

chunks_buffer = [chunk.slice(cropped_chunk_length, len(chunk) - cropped_chunk_length)]

chunks_buffer_size is the sum of the lengths of all the chunks in the buffer - not just the length of the buffer

albertvillanova · 2022-10-14T14:19:01Z

src/datasets/table.py

+            chunks_buffer = [chunk.slice(cropped_chunk_length, len(chunk) - cropped_chunk_length)]
+            chunks_buffer_size = len(chunk) - cropped_chunk_length
+    if not drop_last_batch and chunks_buffer:
+        yield pa.Table.from_batches(chunks_buffer)


I'm just wondering if this function may have a performance impact, instead of just calling for batch in self.data.to_reader(max_chunksize=batch_size), as before.

If so, we should check how much impact, so that we do not lose the performance gain introduced by #5030.

The code is roughly the same as in #5030

Also note that the worst case scenario for this implementation is when the dataset is made of chunks of length 1, but even in this case this is faster than calling __getitem__ for each item.

ds = concatenate_datasets([Dataset.from_dict({"a": [i]}) for i in range(100)]) %time list(ds._iter_batches(batch_size=10)) # <1ms %time [ds[i:i+10] for i in range(0, len(ds), 10)] # 1ms %time list(ds) # 3ms %time [ds[i] for i in range(len(ds))] # 5ms

It's even better for big datasets, since __getitem__ is not O(1) because of interpolation search. Here getting the next item is O(1)

src/datasets/table.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

mariosasko

Thanks, LGTM!

lhoestq added 2 commits October 14, 2022 14:00

fix iter_batches

c30edfe

minor

ecf353a

lhoestq added 2 commits October 14, 2022 15:37

fix tests

3bdf8ef

add more tests

60e4a76

lhoestq marked this pull request as ready for review October 14, 2022 13:39

lhoestq requested review from albertvillanova and mariosasko October 14, 2022 13:39

lhoestq added 6 commits October 14, 2022 15:45

docs + rename

30059ed

remove tmp lines

0600ef9

run test only if pyarrow>=8

1856601

add error message for old versions of pyarrow

e4ae9cd

mention pyarrow>=8 in docstring as well

ed7631d

style

4ac0c1b

albertvillanova approved these changes Oct 14, 2022

View reviewed changes

albertvillanova reviewed Oct 14, 2022

View reviewed changes

src/datasets/table.py Outdated Show resolved Hide resolved

Update src/datasets/table.py

d5f243c

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

mariosasko approved these changes Oct 14, 2022

View reviewed changes

lhoestq merged commit eadc79a into main Oct 14, 2022

lhoestq deleted the fix-iter_batches branch October 14, 2022 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iter_batches #5115

Fix iter_batches #5115

lhoestq commented Oct 14, 2022 •

edited

HuggingFaceDocBuilderDev commented Oct 14, 2022 •

edited

lhoestq commented Oct 14, 2022

lhoestq commented Oct 14, 2022

albertvillanova left a comment

albertvillanova Oct 14, 2022

lhoestq Oct 14, 2022 •

edited

albertvillanova Oct 14, 2022

lhoestq Oct 14, 2022 •

edited

mariosasko left a comment

Fix iter_batches #5115

Fix iter_batches #5115

Conversation

lhoestq commented Oct 14, 2022 • edited

HuggingFaceDocBuilderDev commented Oct 14, 2022 • edited

lhoestq commented Oct 14, 2022

lhoestq commented Oct 14, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Oct 14, 2022

Choose a reason for hiding this comment

lhoestq Oct 14, 2022 • edited

Choose a reason for hiding this comment

albertvillanova Oct 14, 2022

Choose a reason for hiding this comment

lhoestq Oct 14, 2022 • edited

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Oct 14, 2022 •

edited

HuggingFaceDocBuilderDev commented Oct 14, 2022 •

edited

lhoestq Oct 14, 2022 •

edited

lhoestq Oct 14, 2022 •

edited