Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets doesn't support # in data paths #5099

Closed
loubnabnl opened this issue Oct 11, 2022 · 9 comments
Closed

datasets doesn't support # in data paths #5099

loubnabnl opened this issue Oct 11, 2022 · 9 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers hacktoberfest

Comments

@loubnabnl
Copy link

loubnabnl commented Oct 11, 2022

Describe the bug

dataset files with # symbol their paths aren't read correctly.

Steps to reproduce the bug

The data in folder c#of this dataset can't be loaded. While the folder c_sharp with the same data is loaded properly

ds = load_dataset('loubnabnl/bigcode_csharp', split="train", data_files=["data/c#/*"])
FileNotFoundError: Couldn't find file at https://huggingface.co/datasets/loubnabnl/bigcode_csharp/resolve/27a3166cff4bb18e11919cafa6f169c0f57483de/data/c#/data_0003.jsonl

Environment info

  • datasets version: 2.5.2
  • Platform: macOS-12.2.1-arm64-arm-64bit
  • Python version: 3.9.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.4.3

cc @lhoestq

@loubnabnl loubnabnl added the bug Something isn't working label Oct 11, 2022
@lhoestq
Copy link
Member

lhoestq commented Oct 11, 2022

datasets doesn't seem to urlencode the directory names here

def hf_hub_url(repo_id: str, path: str, revision: Optional[str] = None) -> str:
revision = revision or config.HUB_DEFAULT_VERSION
return config.HUB_DATASETS_URL.format(repo_id=repo_id, path=path, revision=revision)

for example we should have

from datasets.utils.file_utils import hf_hub_url

url = hf_hub_url("loubnabnl/bigcode_csharp", "data/c#/data_0003.jsonl")
print(url)
# Currently returns
# https://huggingface.co/datasets/loubnabnl/bigcode_csharp/resolve/main/data/c#/data_0003.jsonl
# while it should be 
# https://huggingface.co/datasets/loubnabnl/bigcode_csharp/resolve/main/data/c%23/data_0003.jsonl

@riccardobucco
Copy link
Contributor

riccardobucco commented Oct 11, 2022

I'll work on this :)

@riccardobucco
Copy link
Contributor

@loubnabnl The dataset you linked in the description of the bug does not work and returns a 404. Where can I find the dataset to reproduce the bug?

@lhoestq
Copy link
Member

lhoestq commented Oct 11, 2022

I think you can create a dataset repository on the Hub with a dummy file containing a #

@loubnabnl
Copy link
Author

Ah sorry it was private I just made it public, I can also help with this if needed

@riccardobucco
Copy link
Contributor

@lhoestq Should I url encode also repo_id and revision parameters? I'm not sure what are the valid characters there.

Personally, I would be cautious and only url encode the path parameter.

@riccardobucco
Copy link
Contributor

riccardobucco commented Oct 11, 2022

These are possible solutions (assuming from urllib.parse import quote):

  1. url encode only the path parameter:
# src/datasets/utils/file_utils.py
def hf_hub_url(repo_id: str, path: str, revision: Optional[str] = None) -> str:
    revision = revision or config.HUB_DEFAULT_VERSION
    return config.HUB_DATASETS_URL.format(repo_id=repo_id, path=quote(path), revision=revision)
  1. url encode all parameters:
# src/datasets/utils/file_utils.py
def hf_hub_url(repo_id: str, path: str, revision: Optional[str] = None) -> str:
    revision = revision or config.HUB_DEFAULT_VERSION
    return config.HUB_DATASETS_URL.format(repo_id=quote(repo_id), path=quote(path), revision=quote(revision))
  1. url encode the whole url:
# src/datasets/config.py
HUB_DATASETS_PATH = "/datasets/{repo_id}/resolve/{revision}/{path}"
HUB_DATASETS_URL = HF_ENDPOINT + HUB_DATASETS_PATH
# src/datasets/utils/file_utils.py
def hf_hub_url(repo_id: str, path: str, revision: Optional[str] = None) -> str:
    revision = revision or config.HUB_DEFAULT_VERSION
    return config.HF_ENDPOINT + quote(config.HUB_DATASETS_PATH.format(repo_id=repo_id, path=path, revision=revision))

@lhoestq
Copy link
Member

lhoestq commented Oct 12, 2022

repo_id can only contain alphanumeric characters and _- so it doesn't need to be encoded.

However I agree it's a good idea to also apply quote to the revision as well as in 2. !

riccardobucco added a commit to riccardobucco/datasets that referenced this issue Oct 12, 2022
lhoestq pushed a commit that referenced this issue Oct 12, 2022
@lhoestq
Copy link
Member

lhoestq commented Oct 12, 2022

Should be fixed by #5099 - we'll do a release later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers hacktoberfest
Projects
None yet
Development

No branches or pull requests

3 participants