Skip to content

Releases: pytorch/data

TorchData 0.7.1

15 Nov 23:23
Compare
Choose a tag to compare

TorchData 0.7.1 Latest

Current status
⚠️ As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use #1196 for feedback).

This is a patch release, which is compatible with PyTorch 2.1.1. There are no new features added.

TorchData 0.7.0

12 Oct 20:24
c5f2204
Compare
Choose a tag to compare

Current status

⚠️ As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use #1196 for feedback).

Bug Fixes

  • MPRS request/response cycle for workers (40dd648)
  • Sequential reading service checkpointing (8d452cf)
  • Cancel future object and always run callback in FullSync during shutdown (#1171)
  • DataPipe, Ensures Prefetcher shuts down properly (#1166)
  • DataPipe, Fix FullSync shutdown hanging issue while paused (#1153)
  • DataPipe, Fix a word in WebDS DataPipe (#1156)
  • DataPipe, Add handler argument to iopath DataPipes (#1154)
  • Prevent in_memory_cache from yielding from source_dp when it's fully cache (#1160)
  • Fix pin_memory to support single-element batch (#1158)
  • DataLoader2, Removing delegation for 'pause', 'limit', and 'resume' (#1067)
  • DataLoader2, Handle MapDataPipe by converting to IterDataPipe internally by default (#1146)

New Features

  • Implement InProcessReadingService (#1139)
  • Enable miniepoch for MultiProcessingReadingService (#1170)
  • DataPipe, Implement pause/resume for FullSync (#1130)
  • DataLoader2, Saving and restoring initial seed generator (#998)
  • Add ThreadPoolMapper (#1052)

TorchData 0.6.1 Release Notes

08 May 20:15
Compare
Choose a tag to compare

TorchData 0.6.1 Beta Release Notes

Highlights

This minor release is aligned with PyTorch 2.0.1 and primarily fixes bugs that are introduced in the 0.6.0 release. We sincerely thank our users and contributors for spotting various bugs and helping us to fix them.

Bug Fixes

DataLoader2

  • Properly clean up processes and queues for MPRS and Fix pause for prefetch (#1096)
  • Fix DataLoader2 seed = 0 bug (#1098)
    • Previously, if seed = 0 was passed into DataLoader2, the seed value in DataLoader2 would not be set and the seed would be unused. This change fixes that and allow seed = 0 to be used normally.
  • Fix worker_init_fn to update DataPipe graph and move worker prefetch to the end of Worker pipeline (#1100)

DataPipe

  • Fix pin_memory_fn to support namedtuple (#1086)
  • Fix typo for portalocker at import time (#1099)

Improvements

DataPipe

  • Skip FullSync operation when world_size == 1 (#1065)

Docs

  • Add long project description to setup.py for display on PyPI (#1094)

Beta Usage Note

This library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback. As always, we welcome new contributors to our repo.

TorchData 0.6.0 Release Notes

15 Mar 19:38
Compare
Choose a tag to compare

TorchData 0.6.0 Beta Release Notes

Highlights

We are excited to announce the release of TorchData 0.6.0. This release is composed of about 130 commits since 0.5.0, made by 27 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.6.0 updates are primarily focused on DataLoader2. We graduate some of its APIs from the prototype stage and introduce additional features. Highlights include:

  • Graduation of MultiProcessingReadingService from prototype to beta
    • This is the default ReadingService that we expect most users to use; it closely aligns with the functionalities of old DataLoader with improvements
    • With this graduation, we expect the APIs and behaviors to be mostly stable going forward. We will continue to add new features as they become ready.
  • Introduction of Sequential ReadingService
    • Enables the usage of multiple ReadingServices at the same time
  • Adding comprehensive tutorial of DataLoader2 and its subcomponents

Backwards Incompatible Change

DataLoader2

  • Officially graduate PrototypeMultiProcessingReadingService to MultiProcessingReadingService (#1009)
    • The APIs of MultiProcessingReadingService as well as the internal implementation have changed. Overall, this should provide a better user experience.
    • Please refer to our documentation for details.

0.5.00.6.0
It previously took the following arguments:
MultiProcessingReadingService(
    num_workers: int = 0,
    pin_memory: bool = False,
    timeout: float = 0,
    worker_init_fn: Optional[Callable[[int], None]] = None,
    multiprocessing_context=None,
    prefetch_factor: Optional[int] = None,
    persistent_workers: bool = False,
)
      
The new version takes these arguments:
MultiProcessingReadingService(
    num_workers: int = 0,
    multiprocessing_context: Optional[str] = None,
    worker_prefetch_cnt: int = 10,
    main_prefetch_cnt: int = 10,
    worker_init_fn: Optional[Callable[[DataPipe, WorkerInfo], DataPipe]] = None,
    worker_reset_fn: Optional[Callable[[DataPipe, WorkerInfo, SeedGenerator], DataPipe]] = None,
)
      

  • Deep copy ReadingService during DataLoader2 initialization (#746)
    • Within DataLoader2, a deep copy of the passed-in ReadingService object is created during initialization and will be subsequently used.
    • This prevents multiple DataLoader2s from accidentally sharing states when the same ReadingService object is passed into them.

0.5.00.6.0
Previously, a ReadingService object that is used in multiple DataLoader2 shared state among them.
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
# Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 4
      
DataLoader2 now deep copies the ReadingService object during initialization and the ReadingService state is no longer shared.
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
# Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
      

Deprecations

DataPipe

In PyTorch Core

  • Remove previously deprecated FileLoaderDataPipe (#89794)
  • Mark imports from torch.utils.data.datapipes.iter.grouping as deprecated (#94527)

TorchData

  • Remove certain deprecated functional APIs as previously scheduled (#890)

Releng

  • Drop support for Python 3.7 as aligned with PyTorch core library (#974)

New Features

DataLoader2

  • Add graph function to list DataPipes from DataPipe graphs (#888)
  • Add functions to set seeds to DataPipe graphs (#894)
  • Add worker_init_fn and worker_reset_fn to MultiProcessingReadingService (#907)
  • Add round robin sharding to support non-replicable DataPipe for MultiProcessing (#919)
    • Guarantee that DataPipes execute reset_iterator when all loops have received reset request in the dispatching process (#994)
  • Add initial support for randomness control within DataLoader2 (#801)
  • Add support for Sequential ReadingService (commit)
  • Enable SequentialReadingService to support MultiProcessing + Distributed (#985)
  • Add limit, pause, resume operations to halt DataPipes in DataLoader2 (#879)

DataPipe

  • Add ShardExpander IterDataPipe (#405)
  • Add RoundRobinDemux IterDataPipe (#903)
  • Implement PinMemory IterDataPipe (#1014)

Releng

  • Add conda Python 3.11 builds (#1010)
  • Enable Python 3.11 conda builds for Mac/Windows (#1026)
  • Update C++ standard to 17 (#1051)

Improvements

DataLoader2

In PyTorch Core

  • Fix apply_sharding to accept one sharding_filter per branch (#90769)

TorchData

  • Consolidate checkpoint contract with checkpoint component (#867)
  • Update load_state_dict() signature to align with TorchSnapshot (#887)
  • Apply sharding based on priority and combine DistInfo and ExtraInfo (used to store distributed metadata) (#916)
  • Prevent reset iteration message from being sent to workers twice (#917)
  • Add support to keep non-replicable DataPipe in the main process (#950)
  • Safeguard DataLoader2Iterator's __getattr__ method (#1004)
  • Forward worker exceptions and have DataLoader2 exit with them (#1003)
  • Attach traceback to Exception and test dispatching process (#1036)

DataPipe

In PyTorch Core

  • Add auto-completion to DataPipes in REPLs (e.g. Jupyter notebook) (#86960)
  • Add group support to sharding_filter (#88424)
  • Add keep_key option to Grouper (#92532)

TorchData

  • Add a masks option to filter files in S3 DataPipe (#880)
  • Make HeaderIterDataPipe with limit=None a no-op (#908)
  • Update fsspec DataPipe to be compatible with the latest version of fsspec (#957)
  • Expand the possible input options for HuggingFace DataPipe (#952)
  • Improve exception handling/skipping in online DataPipes (#968)
  • Allow the option to place key in output in MapKeyZipper (#1042)
  • Allow single key option for Slicer (#1041)

Releng

  • Add pure Python platform-agnostic wheel (#988)

Bug Fixes

DataLoader2

In PyTorch Core

  • Change serialization wrapper implementation to be an iterator (#87459)

DataPipe

In PyTorch Core

  • Fix type checking to accept both Iter and Map DataPipe (#87285)
  • Fix: Make __len__ of datapipes dynamic (#88302)
  • Properly cleanup unclosed files within generator function (#89973)
    ...
Read more

TorchData 0.5.1 Beta Release, small bug fix release

16 Dec 15:19
Compare
Choose a tag to compare

This is a minor release to update PyTorch dependency from 1.13.0 to 1.13.1. Please check the release note of TorchData 0.5.0 major release for more detail.

TorchData 0.5.0 Release Notes

27 Oct 17:08
Compare
Choose a tag to compare

TorchData 0.5.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Bug Fixes
  • Performance
  • Documentation
  • Future Plans
  • Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.5.0 updates are focused on consolidating the DataLoader2 and ReadingService APIs and benchmarking. Highlights include:

  • Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
  • Consolidated API for DataLoader2 and provided a few ReadingServices, with detailed documentation now available here
  • Provided more comprehensive DataPipe operations, e.g., random_split, repeat, set_length, and prefetch.
  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of MapDataPipe.shuffle to an IterDataPipe (pytorch/pytorch#83202)

IterDataPipe is used to to preserve data order

MapDataPipe.shuffle
0.4.10.5.0
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
      
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
      

on_disk_cache now doesn’t accept generator functions for the argument of filename_fn (#810)

on_disk_cache
0.4.10.5.0
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
…     yield from [url + f/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
      
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
…     yield from [url + f/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
# AssertionError
      

DataLoader2

Imposed single iterator constraint on DataLoader2 (#700)

DataLoader2 with a single iterator
0.4.10.5.0
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl)  # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
      
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid
      

Deep copy DataPipe during DataLoader2 initialization or restoration (#786, #833)

Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

Deep copy DataPipe during DataLoader2 constructor
0.4.10.5.0
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
…     print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
      
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
…     print(x, y)
0 0
1 1
2 2
3 3
4 4
      

Deprecations

DataLoader2

Deprecated traverse function and only_datapipe argument (pytorch/pytorch#85667)

Please use traverse_dps with the behavior the same as only_datapipe=True. (#793)

DataPipe traverse function
0.4.10.5.0
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
      
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
      

New Features

DataPipe

  • Added AIStore DataPipe (#545, #667)
  • Added support for IterDataPipe to trace DataFrames operations (pytorch/pytorch#71931,
  • Added support for DataFrameMakerIterDataPipe to accept dtype_generator to solve unserializable dtype (#537)
  • Added graph snapshotting by counting number of successful yields for IterDataPipe (pytorch/pytorch#79479, pytorch/pytorch#79657)
  • Implemented drop operation for IterDataPipe to drop column(s) (#725)
  • Implemented FullSyncIterDataPipe to synchronize distributed shards (#713)
  • Implemented slice and flatten operations for IterDataPipe (#730)
  • Implemented repeat operation for IterDataPipe (#748)
  • Added LengthSetterIterDataPipe (#747)
  • Added RandomSplitter (without buffer) (#724)
  • Added padden_tokens to max_token_bucketize to bucketize samples based on total padded token length (#789)
  • Implemented thread based PrefetcherIterDataPipe (#770, #818, #826, #842)

DataLoader2

  • Added CacheTimeout Adapter to redefine cache timeout of the DataPipe graph (#571)
  • Added DistribtuedReadingService to support uneven data sharding (#727)
  • Added PrototypeMultiProcessingReadingService
    • Added prefetching (#826)
    • Fixed process termination (#837)
    • Enabled deterministic training in distributed/non-distributed environment (#827)
    • Handled empty queue exception properly (#785)

Releng

  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)

Improvements

DataPipe

  • Fixed error message coming from singler iterator constraint (pytorch/pytorch#79547)
  • Enabled profiler record context in __next__ for IterDataPipe (pytorch/pytorch#79757)
  • Raised warning for unpickable local function (#547) (pytorch/pytorch#80232, #547)
  • Cleaned up opened streams on the best effort basis (#560, pytorch/pytorch#78952)
  • Used streaming reading mode for unseekable streams in TarArchiveLoader (#653)
    Improved GDrive 'content-disposition' error message (#654)
  • Added as_tuple argument for CSVParserIterDataPipe` to convert output from list to tuple (#646)
  • Raised Error when HTTPReader get 404 Response (#160) (#569)
  • Added default no-op behavior for flatmap (https://...
Read more

TorchData 0.4.1 Beta Release, small bug fix release

05 Aug 20:47
Compare
Choose a tag to compare

TorchData 0.4.1 Release Notes

Bug fixes

Documentation

Releng

  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)
    • Python [3.8~3.10]

TorchData 0.4.0 Beta Release

28 Jun 18:30
Compare
Choose a tag to compare

TorchData 0.4.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Performance
  • Documentation
  • Future Plans
  • Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.4.0 updates are focused on consolidating the DataPipe APIs and supporting more remote file systems. Highlights include:

  • DataPipe graph is now backward compatible with DataLoader regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial here.
  • AWSSDK is integrated to support listing/loading files from AWS S3.
  • Adding support to read from TFRecord and Hugging Face Hub.
  • DataLoader2 became available in prototype mode. For more details, please check our future plans.

Backwards Incompatible Change

DataPipe

Updated Multiplexer (functional API mux) to stop merging multiple DataPipes whenever the shortest one is exhausted (pytorch/pytorch#77145)

Please use MultiplexerLongest (functional API mux_longgest) to achieve the previous functionality.

0.3.00.4.0
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13
      
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> len(output_dp)
9
      

Enforcing single valid iterator for IterDataPipes w/wo multiple outputs pytorch/pytorch#70479, (pytorch/pytorch#75995)

If you need to reference the same IterDataPipe multiple times, please apply .fork() on the IterDataPipe instance.

IterDataPipe with a single output
0.3.00.4.0
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]
      
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)  # This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)  # Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
      

IterDataPipe with multiple outputs
0.3.00.4.0
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)
# Basically share the same reference as `it1`
# doesn't reset because `cdp1` hasn't been read since reset
>>> next(it1)
0
>>> next(it2)
0
>>> next(it3)
1
# The next line resets all ChildDataPipe
# because `cdp2` has started reading
>>> it4 = iter(cdp2)
>>> next(it3)
0
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
      
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)  # This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
# The next line should not invalidate anything, as there was no new iterator created
# for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
      

Deprecations

DataPipe

Deprecated functional APIs of open_file_by_fsspec and open_file_by_iopath for IterDataPipe (pytorch/pytorch#78970, pytorch/pytorch#79302)

Please use open_files_by_fsspec and open_files_by_iopath

0.3.00.4.0
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()  # No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()  # No Warning
      
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.
      

Argument drop_empty_batches of Filter (functional API filter) is deprecated and going to be removed in the future release (pytorch/pytorch#76060)

0.3.00.4.0
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
      
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
      

New Features

DataPipe

  • Added utility to visualize DataPipe graphs (#330)

IterDataPipe

  • Added Bz2FileLoader with functional API of load_from_bz2 (#312)
  • Added BatchMapper (functional API: map_batches) and FlatMapper (functional API: flat_map) (#359)
  • Added support for WebDataset-style archives (#367)
  • Added MultiplexerLongest with functional API of mux_longest (#372)
  • Add ZipperLongest with functional API of zip_longest (#373)
  • Added MaxTokenBucketizer with functional API of max_token_bucketize (#283)
  • Added S3FileLister (functional API: list_files_by_s3) and `S...
Read more

TorchData 0.3.0 Beta Release

10 Mar 18:43
Compare
Choose a tag to compare

0.3.0 Release Notes

We are delighted to present the Beta release of TorchData. This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called “DataPipes” that work well out of the box with the PyTorch’s DataLoader.

  • Highlights
    • What are DataPipes?
    • Usage Example
  • New Features
  • Documentation
  • Usage in Domain Libraries
  • Future Plans
  • Beta Usage Note

Highlights

We are releasing DataPipes - there are Iterable-style DataPipe (IterDataPipe) and Map-style DataPipe (MapDataPipe).

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch DataSets which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch DataSet for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes , and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs) -> None:
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data)

    def __len__(self):
        return len(self.source_datapipe) 

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, DataSet simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes.

Usage Example

In this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is included in the example folder.

url_dp = IterableWrapper([URL])
cache_compressed_dp = GDriveReader(cache_compressed_dp)
# cache_decompressed_dp = ... # See source file for full code example
# Opens and loads the content of the TAR archive file.
cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").load_from_tar()
# Filters for specific files based on the file name.
cache_decompressed_dp = cache_decompressed_dp.filter(
    lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0]
)
# Saves the decompressed file onto disk.
cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
data_dp = FileOpener(cache_decompressed_dp, mode="b")
# Parses content of the decompressed CSV file and returns the result line by line. return 
return data_dp.parse_csv().map(fn=lambda t: (int(t[0]), " ".join(t[1:])))

New Features

[Beta] IterDataPipe

We have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe.

[Beta] MapDataPipe

Similar to IterDataPipe, we have various, but a more limited number of MapDataPipe available for different functionalities. More MapDataPipes support will come later. If the existing ones do not meet your needs, you can write a custom DataPipe.

Documentation

The documentation for TorchData is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones.

Usage in Domain Libraries

In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the popular datasets provided by the library are implemented using DataPipes and a section of its SST-2 binary text classification tutorial demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in TorchVision (available in nightly releases) and in TorchRec. You can find more specific examples here.

Future Plans

There will be a new version of DataLoader in the next release. At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.