Skip to content

Commit

Permalink
Update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
mariosasko committed Aug 11, 2022
1 parent 0e9b25f commit ce3b184
Showing 1 changed file with 7 additions and 4 deletions.
11 changes: 7 additions & 4 deletions docs/source/image_load.mdx
Expand Up @@ -98,6 +98,8 @@ Your `metadata.jsonl` file must have a `file_name` column which links image file
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}
```

It may be more convenient to specify metadata in CSV for simple tasks. In that case, use `metadata.csv` as the name of the metadata file.

<Tip>

If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set `drop_labels=False` in `load_dataset`.
Expand All @@ -106,12 +108,13 @@ If metadata files are present, the inferred labels based on the directory name a

### Image captioning

Image captioning datasets have text describing an image. An example `metadata.jsonl` may look like:
Image captioning datasets have text describing an image. An example `metadata.csv` may look like:

```jsonl
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua
```

Load the dataset with `ImageFolder`, and it will create a `text` column for the image captions:
Expand Down

1 comment on commit ce3b184

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007343 / 0.011353 (-0.004010) 0.004154 / 0.011008 (-0.006854) 0.035698 / 0.038508 (-0.002810) 0.034616 / 0.023109 (0.011507) 0.320490 / 0.275898 (0.044592) 0.409961 / 0.323480 (0.086481) 0.006535 / 0.007986 (-0.001450) 0.003402 / 0.004328 (-0.000926) 0.008045 / 0.004250 (0.003795) 0.052416 / 0.037052 (0.015363) 0.326947 / 0.258489 (0.068458) 0.415785 / 0.293841 (0.121944) 0.030401 / 0.128546 (-0.098145) 0.010432 / 0.075646 (-0.065214) 0.301438 / 0.419271 (-0.117834) 0.053029 / 0.043533 (0.009496) 0.320090 / 0.255139 (0.064952) 0.353720 / 0.283200 (0.070521) 0.092477 / 0.141683 (-0.049206) 1.626034 / 1.452155 (0.173879) 1.719382 / 1.492716 (0.226666)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.224012 / 0.018006 (0.206006) 0.498788 / 0.000490 (0.498298) 0.001502 / 0.000200 (0.001302) 0.000177 / 0.000054 (0.000122)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022758 / 0.037411 (-0.014653) 0.114333 / 0.014526 (0.099807) 0.121705 / 0.176557 (-0.054851) 0.160319 / 0.737135 (-0.576816) 0.127431 / 0.296338 (-0.168907)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.421254 / 0.215209 (0.206045) 4.006206 / 2.077655 (1.928551) 1.868078 / 1.504120 (0.363958) 1.673892 / 1.541195 (0.132698) 1.649365 / 1.468490 (0.180875) 0.420465 / 4.584777 (-4.164312) 3.834948 / 3.745712 (0.089236) 3.809732 / 5.269862 (-1.460129) 1.859445 / 4.565676 (-2.706232) 0.060174 / 0.424275 (-0.364101) 0.012886 / 0.007607 (0.005279) 0.483399 / 0.226044 (0.257355) 5.485051 / 2.268929 (3.216122) 2.383410 / 55.444624 (-53.061214) 2.060461 / 6.876477 (-4.816015) 2.173534 / 2.142072 (0.031462) 0.610406 / 4.805227 (-4.194821) 0.125149 / 6.500664 (-6.375516) 0.061033 / 0.075469 (-0.014436)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.672220 / 1.841788 (-0.169568) 15.577971 / 8.074308 (7.503662) 26.825447 / 10.191392 (16.634055) 0.960882 / 0.680424 (0.280458) 0.681123 / 0.534201 (0.146922) 0.422719 / 0.579283 (-0.156564) 0.472552 / 0.434364 (0.038188) 0.318867 / 0.540337 (-0.221470) 0.305635 / 1.386936 (-1.081301)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005857 / 0.011353 (-0.005496) 0.004448 / 0.011008 (-0.006560) 0.029509 / 0.038508 (-0.008999) 0.033499 / 0.023109 (0.010390) 0.376013 / 0.275898 (0.100115) 0.474039 / 0.323480 (0.150559) 0.003661 / 0.007986 (-0.004325) 0.003512 / 0.004328 (-0.000816) 0.005481 / 0.004250 (0.001230) 0.045817 / 0.037052 (0.008764) 0.372515 / 0.258489 (0.114026) 0.417629 / 0.293841 (0.123788) 0.028984 / 0.128546 (-0.099563) 0.009859 / 0.075646 (-0.065787) 0.297560 / 0.419271 (-0.121711) 0.051971 / 0.043533 (0.008438) 0.363124 / 0.255139 (0.107985) 0.429858 / 0.283200 (0.146658) 0.106393 / 0.141683 (-0.035290) 1.576457 / 1.452155 (0.124302) 1.562549 / 1.492716 (0.069833)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.228113 / 0.018006 (0.210107) 0.444544 / 0.000490 (0.444055) 0.004383 / 0.000200 (0.004183) 0.000091 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025219 / 0.037411 (-0.012192) 0.109254 / 0.014526 (0.094729) 0.123464 / 0.176557 (-0.053092) 0.165531 / 0.737135 (-0.571605) 0.129262 / 0.296338 (-0.167076)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.409119 / 0.215209 (0.193910) 4.148159 / 2.077655 (2.070505) 2.299628 / 1.504120 (0.795508) 2.213007 / 1.541195 (0.671812) 2.224325 / 1.468490 (0.755835) 0.452836 / 4.584777 (-4.131941) 4.497996 / 3.745712 (0.752284) 1.986707 / 5.269862 (-3.283155) 1.277571 / 4.565676 (-3.288105) 0.055291 / 0.424275 (-0.368984) 0.011272 / 0.007607 (0.003665) 0.512770 / 0.226044 (0.286726) 5.326868 / 2.268929 (3.057940) 2.534348 / 55.444624 (-52.910276) 2.246590 / 6.876477 (-4.629887) 2.435008 / 2.142072 (0.292936) 0.501187 / 4.805227 (-4.304040) 0.114828 / 6.500664 (-6.385836) 0.057682 / 0.075469 (-0.017788)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.556895 / 1.841788 (-0.284893) 15.040751 / 8.074308 (6.966443) 25.423246 / 10.191392 (15.231854) 0.925636 / 0.680424 (0.245212) 0.602724 / 0.534201 (0.068523) 0.362824 / 0.579283 (-0.216459) 0.463228 / 0.434364 (0.028864) 0.278355 / 0.540337 (-0.261982) 0.295371 / 1.386936 (-1.091565)

CML watermark

Please sign in to comment.