Skip to content

Roadmap

Albert Villanova del Moral edited this page Apr 22, 2021 · 1 revision

April 2021: short/mid term roadmap for Datasets

Topics

  • Datasets Hub
  • Datasets Viewer
  • AutoNLP
  • External integrations
  • Tasks + Evaluations
  • Datasets Streaming
  • Image/Audio support
  • Researchers usage
  • GitHub repository
  • Community/Contributors

Datasets Hub

  • Make the dataset script optional
  • Load processed datasets
  • Use cold storage (parquet)
  • More documentation + concrete tutorials
  • Integrate a validation tool in the CI for yaml tags + dataset card

Datasets Viewer

  • Fix runs out of disk space
  • Update the dependencies

AutoNLP

  • Fix methods that have memory issues: cast (WIP), filter, concatenate_datasets
  • Add audio type
  • How to download a processed dataset from the Hub
  • How to implement a universal dataset loader

External integrations

  • Improve error messages per file
  • Test using big JSON files
  • Allow to get datasets metadata without loading them
  • Allow to use the dataset builders as iterators

Tasks + Evaluations

  • Add task-specific preparation
  • Define task-specific feature templates
  • Add task argument in load_dataset
  • Automatic post processing based on the supervised_keys passed in the info and the queried task
  • User defined post processing to cover cases that automatic post processing can't handle (maybe using the post_process method of the builder)
  • Sync with AutoNLP

Datasets Streaming

  • Use fsspec
  • Create a new class StreamingDataset
  • Enable the streaming of csv/text/json data
  • Set the format of a streaming dataset

Image/Audio support

  • Implement new feature types Image and Audio
    • Implement a decoding step
    • Either keep storing the path in the arrow data, or write the encoded bytes in the arrow data

Researchers usage

  • Keep small datasets in memory and without caching
  • Load one split without download and processing the others
  • Update Wikipedia
    • Complete the dataset card with usage examples to show how to use a specific date
    • Preprocess recent wikipedia dumps (en, fr, es, de...)
    • Optimize Beam pipelines
    • Process Wikipedia systematically
  • Add FAQs in the documentation or as a markdown file in the repo

GitHub repository

  • Try git lfs for dummy data
  • Fix conda build

Community

  • Share Roadmap
  • Add all the tasks on the Roadmap as GitHub Issues
  • Create GitHub Projects:
    • Core library
    • Addition of new datasets
  • Improve the docs on how to contribute to the core library
  • Refactorize code to make it simpler