Skip to content

Commit

Permalink
Docs
Browse files Browse the repository at this point in the history
  • Loading branch information
mariosasko committed Sep 8, 2022
1 parent 0e241dd commit 3aa43e7
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 0 deletions.
16 changes: 16 additions & 0 deletions docs/source/loading.mdx
Expand Up @@ -220,6 +220,22 @@ Load a list of Python dictionaries with [`~Dataset.from_list`]:
>>> dataset = Dataset.from_list(my_list)
```

### Python generator

Create a dataset from a Python generator with [`~Dataset.from_generator`]

```py
>>> from datasets import Dataset
>>> def my_gen():
... yield {"a": 1}
... yield {"a": 2}
... yield {"a": 3}
...
>>> dataset = Dataset.from_generator(my_dict)
```

This approach supports loading data larger than available memory.

### Pandas DataFrame

Load Pandas DataFrames with [`~Dataset.from_pandas`]:
Expand Down
1 change: 1 addition & 0 deletions docs/source/package_reference/main_classes.mdx
Expand Up @@ -16,6 +16,7 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
- from_buffer
- from_pandas
- from_dict
- from_generator
- data
- cache_files
- num_columns
Expand Down

1 comment on commit 3aa43e7

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008138 / 0.011353 (-0.003215) 0.003820 / 0.011008 (-0.007188) 0.029197 / 0.038508 (-0.009311) 0.033473 / 0.023109 (0.010364) 0.300923 / 0.275898 (0.025024) 0.375922 / 0.323480 (0.052443) 0.005773 / 0.007986 (-0.002212) 0.003220 / 0.004328 (-0.001108) 0.006895 / 0.004250 (0.002645) 0.048218 / 0.037052 (0.011166) 0.315910 / 0.258489 (0.057421) 0.362973 / 0.293841 (0.069132) 0.029654 / 0.128546 (-0.098892) 0.009558 / 0.075646 (-0.066089) 0.249173 / 0.419271 (-0.170098) 0.047084 / 0.043533 (0.003552) 0.305075 / 0.255139 (0.049936) 0.332702 / 0.283200 (0.049503) 0.094657 / 0.141683 (-0.047026) 1.540047 / 1.452155 (0.087893) 1.570114 / 1.492716 (0.077398)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.206476 / 0.018006 (0.188470) 0.449727 / 0.000490 (0.449237) 0.002867 / 0.000200 (0.002667) 0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022052 / 0.037411 (-0.015359) 0.097033 / 0.014526 (0.082507) 0.106182 / 0.176557 (-0.070374) 0.150973 / 0.737135 (-0.586162) 0.107538 / 0.296338 (-0.188800)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.406543 / 0.215209 (0.191334) 4.065212 / 2.077655 (1.987557) 1.836395 / 1.504120 (0.332275) 1.643568 / 1.541195 (0.102373) 1.688485 / 1.468490 (0.219995) 0.442105 / 4.584777 (-4.142672) 3.286627 / 3.745712 (-0.459085) 3.159039 / 5.269862 (-2.110823) 1.550732 / 4.565676 (-3.014944) 0.052554 / 0.424275 (-0.371721) 0.010707 / 0.007607 (0.003099) 0.515135 / 0.226044 (0.289091) 5.166752 / 2.268929 (2.897824) 2.277682 / 55.444624 (-53.166943) 1.964927 / 6.876477 (-4.911549) 2.096129 / 2.142072 (-0.045943) 0.560461 / 4.805227 (-4.244766) 0.119634 / 6.500664 (-6.381030) 0.066144 / 0.075469 (-0.009325)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.544329 / 1.841788 (-0.297459) 12.659310 / 8.074308 (4.585002) 26.088120 / 10.191392 (15.896728) 0.867383 / 0.680424 (0.186959) 0.601313 / 0.534201 (0.067112) 0.345065 / 0.579283 (-0.234218) 0.395007 / 0.434364 (-0.039357) 0.230394 / 0.540337 (-0.309943) 0.235669 / 1.386936 (-1.151268)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005694 / 0.011353 (-0.005658) 0.003534 / 0.011008 (-0.007474) 0.026757 / 0.038508 (-0.011751) 0.027937 / 0.023109 (0.004828) 0.334309 / 0.275898 (0.058411) 0.396603 / 0.323480 (0.073123) 0.003360 / 0.007986 (-0.004625) 0.002870 / 0.004328 (-0.001458) 0.004544 / 0.004250 (0.000294) 0.035934 / 0.037052 (-0.001119) 0.340780 / 0.258489 (0.082291) 0.385589 / 0.293841 (0.091748) 0.026946 / 0.128546 (-0.101600) 0.009238 / 0.075646 (-0.066408) 0.248086 / 0.419271 (-0.171186) 0.045956 / 0.043533 (0.002423) 0.336976 / 0.255139 (0.081837) 0.364536 / 0.283200 (0.081336) 0.089306 / 0.141683 (-0.052377) 1.523780 / 1.452155 (0.071625) 1.523722 / 1.492716 (0.031006)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.225473 / 0.018006 (0.207467) 0.427614 / 0.000490 (0.427124) 0.000925 / 0.000200 (0.000725) 0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022342 / 0.037411 (-0.015070) 0.098192 / 0.014526 (0.083667) 0.107087 / 0.176557 (-0.069469) 0.146463 / 0.737135 (-0.590673) 0.109987 / 0.296338 (-0.186352)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.431732 / 0.215209 (0.216523) 4.298915 / 2.077655 (2.221260) 2.060714 / 1.504120 (0.556594) 1.851957 / 1.541195 (0.310762) 1.904305 / 1.468490 (0.435815) 0.443901 / 4.584777 (-4.140876) 3.353349 / 3.745712 (-0.392363) 1.926214 / 5.269862 (-3.343648) 1.115986 / 4.565676 (-3.449691) 0.053189 / 0.424275 (-0.371086) 0.011367 / 0.007607 (0.003760) 0.537817 / 0.226044 (0.311772) 5.390857 / 2.268929 (3.121928) 2.501415 / 55.444624 (-52.943209) 2.183929 / 6.876477 (-4.692548) 2.304741 / 2.142072 (0.162668) 0.556909 / 4.805227 (-4.248318) 0.119451 / 6.500664 (-6.381213) 0.064388 / 0.075469 (-0.011081)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.562372 / 1.841788 (-0.279416) 13.158001 / 8.074308 (5.083693) 26.426858 / 10.191392 (16.235466) 0.900347 / 0.680424 (0.219923) 0.632322 / 0.534201 (0.098121) 0.351256 / 0.579283 (-0.228027) 0.404936 / 0.434364 (-0.029428) 0.239645 / 0.540337 (-0.300693) 0.246650 / 1.386936 (-1.140286)

CML watermark

Please sign in to comment.