Skip to content

Commit

Permalink
Fix CONTRIBUTING once dataset scripts transferred to Hub (#5067)
Browse files Browse the repository at this point in the history
* Update CONTRIBUTING guide

* Fix line separator
  • Loading branch information
albertvillanova committed Oct 6, 2022
1 parent f8eacbd commit 3ee5aed
Showing 1 changed file with 117 additions and 117 deletions.
234 changes: 117 additions & 117 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,117 +1,117 @@
# How to contribute to Datasets?
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](CODE_OF_CONDUCT.md)

Datasets is an open source project, so all contributions and suggestions are welcome.

You can contribute in many different ways: giving ideas, answering questions, reporting bugs, proposing enhancements,
improving the documentation, fixing bugs,...

Many thanks in advance to every contributor.

In order to facilitate healthy, constructive behavior in an open and inclusive community, we all respect and abide by
our [code of conduct](CODE_OF_CONDUCT.md).

## How to work on an open Issue?
You have the list of open Issues at: https://github.com/huggingface/datasets/issues

Some of them may have the label `help wanted`: that means that any contributor is welcomed!

If you would like to work on any of the open Issues:

1. Make sure it is not already assigned to someone else. You have the assignee (if any) on the top of the right column of the Issue page.

2. You can self-assign it by commenting on the Issue page with one of the keywords: `#take` or `#self-assign`.

3. Work on your self-assigned issue and eventually create a Pull Request.

## How to create a Pull Request?
1. Fork the [repository](https://github.com/huggingface/datasets) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

2. Clone your fork to your local disk, and add the base repository as a remote:

```bash
git clone git@github.com:<your Github handle>/datasets.git
cd datasets
git remote add upstream https://github.com/huggingface/datasets.git
```

3. Create a new branch to hold your development changes:

```bash
git checkout -b a-descriptive-name-for-my-changes
```

**do not** work on the `main` branch.

4. Set up a development environment by running the following command in a virtual environment:

```bash
pip install -e ".[dev]"
```

(If datasets was already installed in the virtual environment, remove
it with `pip uninstall datasets` before reinstalling it in editable
mode with the `-e` flag.)

5. Develop the features on your branch. If you want to add a dataset see more in-detail instructions in the section [*How to add a dataset*](#how-to-add-a-dataset).

6. Format your code. Run black and isort so that your newly added files look nice with the following command:

```bash
make style
```

7. Once you're happy with your dataset script file, add your changes and make a commit to record your changes locally:

```bash
git add datasets/<your_dataset_name>
git commit
```

It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:

```bash
git fetch upstream
git rebase upstream/main
```

Push the changes to your account using:

```bash
git push -u origin a-descriptive-name-for-my-changes
```

8. Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

## How to add a dataset

You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:

* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)

## How to contribute to the dataset cards

Improving the documentation of datasets is an ever increasing effort and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.

If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To to do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:

* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/main/datasets/eli5/README.md)

Note that datasets that are outside of a namespace (`squad`, `imagenet-1k`, etc.) are maintained on GitHub. In this case you have to open a Pull request on GitHub to edit the file at `datasets/<dataset-name>/README.md`.

If you are a **dataset author**... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement.

If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.

Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).

Thank you for your contribution!

## Code of conduct

This project adheres to the HuggingFace [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to uphold this code.
# How to contribute to Datasets?
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](CODE_OF_CONDUCT.md)

Datasets is an open source project, so all contributions and suggestions are welcome.

You can contribute in many different ways: giving ideas, answering questions, reporting bugs, proposing enhancements,
improving the documentation, fixing bugs,...

Many thanks in advance to every contributor.

In order to facilitate healthy, constructive behavior in an open and inclusive community, we all respect and abide by
our [code of conduct](CODE_OF_CONDUCT.md).

## How to work on an open Issue?
You have the list of open Issues at: https://github.com/huggingface/datasets/issues

Some of them may have the label `help wanted`: that means that any contributor is welcomed!

If you would like to work on any of the open Issues:

1. Make sure it is not already assigned to someone else. You have the assignee (if any) on the top of the right column of the Issue page.

2. You can self-assign it by commenting on the Issue page with the keyword: `#self-assign`.

3. Work on your self-assigned issue and eventually create a Pull Request.

## How to create a Pull Request?
If you want to add a dataset see specific instructions in the section [*How to add a dataset*](#how-to-add-a-dataset).

1. Fork the [repository](https://github.com/huggingface/datasets) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

2. Clone your fork to your local disk, and add the base repository as a remote:

```bash
git clone git@github.com:<your Github handle>/datasets.git
cd datasets
git remote add upstream https://github.com/huggingface/datasets.git
```

3. Create a new branch to hold your development changes:

```bash
git checkout -b a-descriptive-name-for-my-changes
```

**do not** work on the `main` branch.

4. Set up a development environment by running the following command in a virtual environment:

```bash
pip install -e ".[dev]"
```

(If datasets was already installed in the virtual environment, remove
it with `pip uninstall datasets` before reinstalling it in editable
mode with the `-e` flag.)

5. Develop the features on your branch.

6. Format your code. Run black and isort so that your newly added files look nice with the following command:

```bash
make style
```

7. Once you're happy with your contribution, add your changed files and make a commit to record your changes locally:

```bash
git add -u
git commit
```

It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:

```bash
git fetch upstream
git rebase upstream/main
```

8. Once you are satisfied, push the changes to your fork repo using:

```bash
git push -u origin a-descriptive-name-for-my-changes
```

Go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

## How to add a dataset

You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:

* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)

## How to contribute to the dataset cards

Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.

If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:

* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/main/datasets/eli5/README.md)

If you are a **dataset author**... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement.

If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.

Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).

Thank you for your contribution!

## Code of conduct

This project adheres to the HuggingFace [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to abide by this code.

1 comment on commit 3ee5aed

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007613 / 0.011353 (-0.003740) 0.003556 / 0.011008 (-0.007453) 0.028565 / 0.038508 (-0.009943) 0.029667 / 0.023109 (0.006558) 0.304531 / 0.275898 (0.028633) 0.362064 / 0.323480 (0.038584) 0.005475 / 0.007986 (-0.002510) 0.002957 / 0.004328 (-0.001372) 0.006548 / 0.004250 (0.002298) 0.036972 / 0.037052 (-0.000080) 0.310821 / 0.258489 (0.052332) 0.354188 / 0.293841 (0.060347) 0.028975 / 0.128546 (-0.099571) 0.009328 / 0.075646 (-0.066318) 0.247240 / 0.419271 (-0.172031) 0.044125 / 0.043533 (0.000592) 0.301855 / 0.255139 (0.046716) 0.327445 / 0.283200 (0.044245) 0.089425 / 0.141683 (-0.052258) 1.569484 / 1.452155 (0.117329) 1.651355 / 1.492716 (0.158638)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.199763 / 0.018006 (0.181757) 0.408923 / 0.000490 (0.408433) 0.002618 / 0.000200 (0.002418) 0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021119 / 0.037411 (-0.016293) 0.091591 / 0.014526 (0.077065) 0.104530 / 0.176557 (-0.072027) 0.149466 / 0.737135 (-0.587670) 0.104529 / 0.296338 (-0.191809)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.416345 / 0.215209 (0.201136) 4.138319 / 2.077655 (2.060665) 1.837866 / 1.504120 (0.333746) 1.640373 / 1.541195 (0.099178) 1.658069 / 1.468490 (0.189579) 0.445928 / 4.584777 (-4.138849) 3.293875 / 3.745712 (-0.451837) 1.850440 / 5.269862 (-3.419422) 1.232788 / 4.565676 (-3.332888) 0.053253 / 0.424275 (-0.371022) 0.010721 / 0.007607 (0.003114) 0.522004 / 0.226044 (0.295960) 5.222169 / 2.268929 (2.953241) 2.281665 / 55.444624 (-53.162959) 1.929323 / 6.876477 (-4.947153) 2.010058 / 2.142072 (-0.132015) 0.560715 / 4.805227 (-4.244512) 0.117149 / 6.500664 (-6.383515) 0.062655 / 0.075469 (-0.012814)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.519198 / 1.841788 (-0.322590) 12.276313 / 8.074308 (4.202005) 25.948482 / 10.191392 (15.757090) 0.858358 / 0.680424 (0.177935) 0.607010 / 0.534201 (0.072810) 0.353915 / 0.579283 (-0.225368) 0.400647 / 0.434364 (-0.033717) 0.232644 / 0.540337 (-0.307693) 0.237742 / 1.386936 (-1.149194)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005478 / 0.011353 (-0.005875) 0.003647 / 0.011008 (-0.007361) 0.026796 / 0.038508 (-0.011712) 0.027676 / 0.023109 (0.004567) 0.425244 / 0.275898 (0.149346) 0.459519 / 0.323480 (0.136039) 0.003242 / 0.007986 (-0.004743) 0.002894 / 0.004328 (-0.001435) 0.004586 / 0.004250 (0.000335) 0.033704 / 0.037052 (-0.003348) 0.430393 / 0.258489 (0.171904) 0.468568 / 0.293841 (0.174728) 0.024003 / 0.128546 (-0.104544) 0.006591 / 0.075646 (-0.069055) 0.246252 / 0.419271 (-0.173019) 0.042326 / 0.043533 (-0.001207) 0.426308 / 0.255139 (0.171169) 0.451043 / 0.283200 (0.167844) 0.084749 / 0.141683 (-0.056934) 1.510614 / 1.452155 (0.058459) 1.577255 / 1.492716 (0.084539)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.269912 / 0.018006 (0.251906) 0.400207 / 0.000490 (0.399718) 0.016231 / 0.000200 (0.016031) 0.000161 / 0.000054 (0.000106)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.020568 / 0.037411 (-0.016844) 0.094206 / 0.014526 (0.079680) 0.101310 / 0.176557 (-0.075246) 0.137888 / 0.737135 (-0.599248) 0.101872 / 0.296338 (-0.194466)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.433931 / 0.215209 (0.218722) 4.327140 / 2.077655 (2.249485) 2.031247 / 1.504120 (0.527127) 1.824636 / 1.541195 (0.283442) 1.851636 / 1.468490 (0.383146) 0.445928 / 4.584777 (-4.138849) 3.352526 / 3.745712 (-0.393186) 2.556481 / 5.269862 (-2.713381) 1.495588 / 4.565676 (-3.070089) 0.052727 / 0.424275 (-0.371548) 0.010730 / 0.007607 (0.003123) 0.535162 / 0.226044 (0.309117) 5.366972 / 2.268929 (3.098043) 2.463581 / 55.444624 (-52.981044) 2.129364 / 6.876477 (-4.747113) 2.210092 / 2.142072 (0.068020) 0.560274 / 4.805227 (-4.244953) 0.118947 / 6.500664 (-6.381718) 0.063151 / 0.075469 (-0.012319)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.544239 / 1.841788 (-0.297549) 12.430068 / 8.074308 (4.355760) 12.032127 / 10.191392 (1.840735) 0.904444 / 0.680424 (0.224020) 0.625654 / 0.534201 (0.091453) 0.332170 / 0.579283 (-0.247113) 0.378297 / 0.434364 (-0.056067) 0.226010 / 0.540337 (-0.314327) 0.227280 / 1.386936 (-1.159656)

CML watermark

Please sign in to comment.