Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CSV metadata files to ImageFolder #4837

Merged
merged 9 commits into from Aug 31, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 12 additions & 9 deletions docs/source/audio_load.mdx
Expand Up @@ -61,23 +61,26 @@ You can also load a dataset with an `AudioFolder` dataset builder. It does not r

## AudioFolder with metadata

To link your audio files with metadata information, make sure your dataset has a `metadata.jsonl` file. Your dataset structure might look like:
To link your audio files with metadata information, make sure your dataset has a `metadata.csv` file. Your dataset structure might look like:

```
folder/train/metadata.jsonl
folder/train/metadata.csv
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3
```

Your `metadata.jsonl` file must have a `file_name` column which links audio files with their metadata. An example `metadata.jsonl` file might look like:
Your `metadata.csv` file must have a `file_name` column which links audio files with their metadata. An example `metadata.csv` file might look like:

```python
{"file_name": "first_audio_file.mp3", "transcription": "znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi"}
{"file_name": "second_audio_file.mp3", "transcription": "już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko"}
{"file_name": "third_audio_file.mp3", "transcription": "pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali"}
```text
file_name,transcription
first_audio_file.mp3,znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi
second_audio_file.mp3,już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko
third_audio_file.mp3,pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali
```

Metadata can also be specified as JSON Lines, in which case use `metadata.jsonl` as the name of the metadata file. This format is helpful in scenarios when one of the columns is complex, e.g. a list of floats, to avoid parsing errors or reading the complex values as strings.

Load your audio dataset by specifying `audiofolder` and the directory containing your data in `data_dir`:

```py
Expand All @@ -86,7 +89,7 @@ Load your audio dataset by specifying `audiofolder` and the directory containing
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```

`AudioFolder` will load audio data and create a `transcription` column containing texts from `metadata.jsonl`:
`AudioFolder` will load audio data and create a `transcription` column containing texts from `metadata.csv`:

```py
>>> dataset["train"][0]
Expand Down Expand Up @@ -146,7 +149,7 @@ If you have metadata files inside your data directory, but you still want to inf

<Tip>

Alternatively, you can add `label` column to your `metadata.jsonl` file.
Alternatively, you can add `label` column to your `metadata.csv` file.

</Tip>

Expand Down
28 changes: 16 additions & 12 deletions docs/source/image_load.mdx
Expand Up @@ -82,22 +82,25 @@ Load remote datasets from their URLs with the `data_files` parameter:

## ImageFolder with metadata

Metadata associated with your dataset can also be loaded, extending the utility of `ImageFolder` to additional vision tasks like image captioning and object detection. Make sure your dataset has a `metadata.jsonl` file:
Metadata associated with your dataset can also be loaded, extending the utility of `ImageFolder` to additional vision tasks like image captioning and object detection. Make sure your dataset has a `metadata.csv` file:

```
folder/train/metadata.jsonl
folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png
```
Your `metadata.jsonl` file must have a `file_name` column which links image files with their metadata:
Your `metadata.csv` file must have a `file_name` column which links image files with their metadata:

```jsonl
{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}
```text
file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images
```

For complex value types, e.g. a list of floats, it may be more convenient to specify metadata as JSON Lines to avoid parsing errors or reading them as strings. In that case, use `metadata.jsonl` as the name of the metadata file.

<Tip>

If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set `drop_labels=False` in `load_dataset`.
Expand All @@ -106,12 +109,13 @@ If metadata files are present, the inferred labels based on the directory name a

### Image captioning

Image captioning datasets have text describing an image. An example `metadata.jsonl` may look like:
Image captioning datasets have text describing an image. An example `metadata.csv` may look like:

```jsonl
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
```text
file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua
```

Load the dataset with `ImageFolder`, and it will create a `text` column for the image captions:
Expand Down
7 changes: 6 additions & 1 deletion src/datasets/data_files.py
Expand Up @@ -79,7 +79,12 @@ class Url(str):
DEFAULT_PATTERNS_SPLIT_IN_DIR_NAME,
DEFAULT_PATTERNS_ALL,
]
METADATA_PATTERNS = ["metadata.jsonl", "**/metadata.jsonl"] # metadata file for ImageFolder and AudioFolder
METADATA_PATTERNS = [
"metadata.csv",
"**/metadata.csv",
"metadata.jsonl",
"**/metadata.jsonl",
] # metadata file for ImageFolder and AudioFolder
WILDCARD_CHARACTERS = "*[]"
FILES_TO_IGNORE = ["README.md", "config.json", "dataset_infos.json", "dummy_data.zip", "dataset_dict.json"]

Expand Down
Expand Up @@ -4,6 +4,8 @@
from dataclasses import dataclass
from typing import Any, List, Optional, Tuple

import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.json as paj

Expand Down Expand Up @@ -68,7 +70,7 @@ class FolderBasedBuilder(datasets.GeneratorBasedBuilder):
EXTENSIONS: List[str]

SKIP_CHECKSUM_COMPUTATION_BY_DEFAULT: bool = True
METADATA_FILENAME: str = "metadata.jsonl"
METADATA_FILENAMES: List[str] = ["metadata.csv", "metadata.jsonl"]

def _info(self):
return datasets.DatasetInfo(features=self.config.features)
Expand Down Expand Up @@ -97,12 +99,12 @@ def analyze(files_or_archives, downloaded_files_or_dirs, split):
if original_file_ext.lower() in self.EXTENSIONS:
if not self.config.drop_labels:
labels.add(os.path.basename(os.path.dirname(original_file)))
elif os.path.basename(original_file) == self.METADATA_FILENAME:
elif os.path.basename(original_file) in self.METADATA_FILENAMES:
metadata_files[split].add((original_file, downloaded_file))
else:
original_file_name = os.path.basename(original_file)
logger.debug(
f"The file '{original_file_name}' was ignored: it is not an {self.BASE_COLUMN_NAME}, and is not {self.METADATA_FILENAME} either."
f"The file '{original_file_name}' was ignored: it is not an image, and is not {self.METADATA_FILENAMES} either."
)
else:
archives, downloaded_dirs = files_or_archives, downloaded_files_or_dirs
Expand All @@ -113,13 +115,13 @@ def analyze(files_or_archives, downloaded_files_or_dirs, split):
if downloaded_dir_file_ext in self.EXTENSIONS:
if not self.config.drop_labels:
labels.add(os.path.basename(os.path.dirname(downloaded_dir_file)))
elif os.path.basename(downloaded_dir_file) == self.METADATA_FILENAME:
elif os.path.basename(downloaded_dir_file) in self.METADATA_FILENAMES:
metadata_files[split].add((None, downloaded_dir_file))
else:
archive_file_name = os.path.basename(archive)
original_file_name = os.path.basename(downloaded_dir_file)
logger.debug(
f"The file '{original_file_name}' from the archive '{archive_file_name}' was ignored: it is not an {self.BASE_COLUMN_NAME}, and is not {self.METADATA_FILENAME} either."
f"The file '{original_file_name}' from the archive '{archive_file_name}' was ignored: it is not an {self.BASE_COLUMN_NAME}, and is not {self.METADATA_FILENAMES} either."
)

data_files = self.config.data_files
Expand Down Expand Up @@ -173,9 +175,18 @@ def analyze(files_or_archives, downloaded_files_or_dirs, split):
# * all metadata files have the same set of features
# * the `file_name` key is one of the metadata keys and is of type string
features_per_metadata_file: List[Tuple[str, datasets.Features]] = []

# Check that all metadata files share the same format
metadata_ext = set(
os.path.splitext(downloaded_metadata_file)[1][1:]
for _, downloaded_metadata_file in itertools.chain.from_iterable(metadata_files.values())
)
if len(metadata_ext) > 1:
raise ValueError(f"Found metadata files with different extensions: {list(metadata_ext)}")
metadata_ext = metadata_ext.pop()

for _, downloaded_metadata_file in itertools.chain.from_iterable(metadata_files.values()):
with open(downloaded_metadata_file, "rb") as f:
pa_metadata_table = paj.read_json(f)
pa_metadata_table = self._read_metadata(downloaded_metadata_file)
features_per_metadata_file.append(
(downloaded_metadata_file, datasets.Features.from_arrow_schema(pa_metadata_table.schema))
)
Expand Down Expand Up @@ -232,12 +243,21 @@ def _split_files_and_archives(self, data_files):
_, data_file_ext = os.path.splitext(data_file)
if data_file_ext.lower() in self.EXTENSIONS:
files.append(data_file)
elif os.path.basename(data_file) == self.METADATA_FILENAME:
elif os.path.basename(data_file) in self.METADATA_FILENAMES:
files.append(data_file)
else:
archives.append(data_file)
return files, archives

def _read_metadata(self, metadata_file):
metadata_file_ext = os.path.splitext(metadata_file)[1][1:]
if metadata_file_ext == "csv":
# Use `pd.read_csv` (although slower) instead of `pyarrow.csv.read_csv` for reading CSV files for consistency with the CSV packaged module
return pa.Table.from_pandas(pd.read_csv(metadata_file))
else:
with open(metadata_file, "rb") as f:
return paj.read_json(f)

def _generate_examples(self, files, metadata_files, split_name, add_metadata, add_labels):
split_metadata_files = metadata_files.get(split_name, [])
sample_empty_metadata = (
Expand All @@ -248,6 +268,13 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
metadata_dict = None
downloaded_metadata_file = None

if split_metadata_files:
metadata_ext = set(
os.path.splitext(downloaded_metadata_file)[1][1:]
for _, downloaded_metadata_file in split_metadata_files
)
metadata_ext = metadata_ext.pop()

file_idx = 0
for original_file, downloaded_file_or_dir in files:
if original_file is not None:
Expand Down Expand Up @@ -276,8 +303,7 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
_, metadata_file, downloaded_metadata_file = min(
metadata_file_candidates, key=lambda x: count_path_segments(x[0])
)
with open(downloaded_metadata_file, "rb") as f:
pa_metadata_table = paj.read_json(f)
pa_metadata_table = self._read_metadata(downloaded_metadata_file)
pa_file_name_array = pa_metadata_table["file_name"]
pa_file_name_array = pc.replace_substring(
pa_file_name_array, pattern="\\", replacement="/"
Expand All @@ -292,7 +318,7 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
}
else:
raise ValueError(
f"One or several metadata.jsonl were found, but not in the same directory or in a parent directory of {downloaded_file_or_dir}."
f"One or several metadata.{metadata_ext} were found, but not in the same directory or in a parent directory of {downloaded_file_or_dir}."
)
if metadata_dir is not None and downloaded_metadata_file is not None:
file_relpath = os.path.relpath(original_file, metadata_dir)
Expand All @@ -304,7 +330,7 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
sample_metadata = metadata_dict[file_relpath]
else:
raise ValueError(
f"One or several metadata.jsonl were found, but not in the same directory or in a parent directory of {downloaded_file_or_dir}."
f"One or several metadata.{metadata_ext} were found, but not in the same directory or in a parent directory of {downloaded_file_or_dir}."
)
else:
sample_metadata = {}
Expand Down Expand Up @@ -346,8 +372,7 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
_, metadata_file, downloaded_metadata_file = min(
metadata_file_candidates, key=lambda x: count_path_segments(x[0])
)
with open(downloaded_metadata_file, "rb") as f:
pa_metadata_table = paj.read_json(f)
pa_metadata_table = self._read_metadata(downloaded_metadata_file)
pa_file_name_array = pa_metadata_table["file_name"]
pa_file_name_array = pc.replace_substring(
pa_file_name_array, pattern="\\", replacement="/"
Expand All @@ -362,7 +387,7 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
}
else:
raise ValueError(
f"One or several metadata.jsonl were found, but not in the same directory or in a parent directory of {downloaded_dir_file}."
f"One or several metadata.{metadata_ext} were found, but not in the same directory or in a parent directory of {downloaded_dir_file}."
)
if metadata_dir is not None and downloaded_metadata_file is not None:
downloaded_dir_file_relpath = os.path.relpath(downloaded_dir_file, metadata_dir)
Expand All @@ -374,7 +399,7 @@ def _generate_examples(self, files, metadata_files, split_name, add_metadata, ad
sample_metadata = metadata_dict[downloaded_dir_file_relpath]
else:
raise ValueError(
f"One or several metadata.jsonl were found, but not in the same directory or in a parent directory of {downloaded_dir_file}."
f"One or several metadata.{metadata_ext} were found, but not in the same directory or in a parent directory of {downloaded_dir_file}."
)
else:
sample_metadata = {}
Expand Down