Skip to content

Commit

Permalink
Merge branch 'master' into enh-codespell
Browse files Browse the repository at this point in the history
  • Loading branch information
christian-monch committed Feb 28, 2023
2 parents a109d62 + e4fe1b9 commit 7e8666f
Show file tree
Hide file tree
Showing 8 changed files with 129 additions and 71 deletions.
9 changes: 5 additions & 4 deletions .appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,8 @@ test_script:
- sh: mkdir __testhome__
- cd __testhome__
# run test selection
- cmd: python -m pytest -c ../tox.ini -s -v --cov=datalad_metalad ..\%DTS%
- sh: python -m pytest -c ../tox.ini -s -v --cov=datalad_metalad ../${DTS}
- cmd: python -m pytest -c ../tox.ini -s -v --cov=datalad_metalad --pyargs ..\%DTS%
- sh: python -m pytest -c ../tox.ini -s -v --cov=datalad_metalad --pyargs ../${DTS}


after_test:
Expand All @@ -201,18 +201,19 @@ after_test:
- cd ..
- python -m coverage xml
# Import public codecov key to verify uploader signatures

- sh: pwd
- sh: ls
- sh: gpg --no-default-keyring --keyring trustedkeys.gpg --import .codecov.pgp_keys.asc
- cmd: gpg --no-default-keyring --keyring trustedkeys.gpg --import .codecov.pgp_keys.asc
# Fetch latest uploader for linux and macos
# Fetch the latest uploader for linux and macos
- sh: curl -Os https://uploader.codecov.io/latest/${CODECOV_PLATFORM}/codecov
- sh: curl -Os https://uploader.codecov.io/latest/${CODECOV_PLATFORM}/codecov.SHA256SUM
- sh: curl -Os https://uploader.codecov.io/latest/${CODECOV_PLATFORM}/codecov.SHA256SUM.sig
- sh: gpgv codecov.SHA256SUM.sig codecov.SHA256SUM
- sh: chmod +x codecov
- sh: ./codecov -f "coverage.xml"
# Fetch latest windows codecov.exe
# Fetch the latest windows codecov.exe
- ps: Invoke-WebRequest -Uri https://uploader.codecov.io/latest/windows/codecov.exe -Outfile codecov.exe
- ps: Invoke-WebRequest -Uri https://uploader.codecov.io/latest/windows/codecov.exe.SHA256SUM -Outfile codecov.exe.SHA256SUM
- ps: Invoke-WebRequest -Uri https://uploader.codecov.io/latest/windows/codecov.exe.SHA256SUM.sig -Outfile codecov.exe.SHA256SUM.sig
Expand Down
41 changes: 5 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,36 +9,8 @@ This software is a [DataLad](http://datalad.org) extension that equips DataLad
with an alternative command suite for metadata handling (extraction, aggregation,
filtering, and reporting).

Please note that the metadata storage format introduced in release 0.3.0 is incompatible
with the metadata storage format in previous versions, i.e. `0.2.x`, and in DataLad
proper. They both happily coexist on storage, but this version of metalad will not
be able to read metadata that was stored by the previous version and vice versa.
Eventually there will be an importer that will pull old-version metadata into
the new metadata storage. It is planned for release 0.3.1

Here is an overview of the changes in 0.3.0 (the new system is quite
different from the previous release in a few ways):

1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add).

2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created.

3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported.

4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a *metadata-store* which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository.

5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering.

6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported.

7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage)

8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage)

9. Backward compatibility supporting an import from previous metadata storage (planned for 0.3.1).


Command(s) currently provided by this extension
#### Command(s) currently provided by this extension

- `meta-extract` -- run an extractor on a file or dataset and emit the
resulting metadata (stdout).
Expand All @@ -63,7 +35,7 @@ such as metadata-extraction and metadata-adding.Processors
are usually executed in parallel. A few pipeline definitions are provided
with the release.

Commands currently under development:
#### Commands currently under development:

- `meta-export` -- write a flat representation of metadata to a file-system. For now you
can export your metadata to a JSON-lines file named `metadata-dump.jsonl`:
Expand All @@ -77,13 +49,10 @@ Commands currently under development:
datalad meta-add -d <dataset-path> --json-lines -i metadata-dump.jsonl
```

- `meta-ingest-previous` -- ingest metadata from `metalady<=0.2.1`.


*A word of caution: documentation is still lacking and will be addressed with release 0.3.1.*
- `meta-ingest-previous` -- ingest metadata from `metalad<=0.2.1`.


Additional metadata extractor implementations
#### Additional metadata extractor implementations

- Compatible with the previous families of extractors provided by datalad
and by metalad, i.e. `metalad_core`, `metalad_annex`, `metalad_custom`, `metalad_runprov`
Expand All @@ -102,7 +71,7 @@ data in the input file



Indexers
#### Indexers

- Provides indexers for the new datalad indexer-plugin interface. These indexers
convert metadata in proprietary formats into a set of key-value pairs that can
Expand Down
29 changes: 16 additions & 13 deletions datalad_metalad/add.py
Original file line number Diff line number Diff line change
Expand Up @@ -472,16 +472,17 @@ def add_finite_set(metadata_objects: List[JSONType],
tvl_us_cache=tvl_us_cache,
mrr_cache=mrr_cache)

error_result = check_dataset_ids(
metadata_store,
UUID(dataset_id),
add_parameter)

if error_result:
if not allow_id_mismatch:
yield error_result
continue
lgr.warning(error_result["message"])
if not un_versioned_path:
error_result = check_dataset_ids(
metadata_store,
UUID(dataset_id),
add_parameter)

if error_result:
if not allow_id_mismatch:
yield error_result
continue
lgr.warning(error_result["message"])

# If the key "path" is present in the metadata
# dictionary, we assume that the metadata-dictionary describes
Expand Down Expand Up @@ -673,6 +674,8 @@ def _get_top_nodes(realm: Path,
# path element in the version list (which confusingly is also called
# "path".
assert ap.dataset_path in (top_level_dataset_tree_path, None)

# We leave the creation of the respective nodes to auto_create
return get_top_nodes_and_metadata_root_record(
mapper_family=default_mapper_family,
realm=str(realm),
Expand All @@ -684,10 +687,10 @@ def _get_top_nodes(realm: Path,
sub_dataset_version=None,
auto_create=True)

# This is an aggregated add. The inter-dataset path (aka. dataset-tree-path)
# This is an aggregated add. The inter-dataset path (aka: dataset-tree-path)
# must not be "", and the un-versioned path must be "".
assert ap.dataset_path != MetadataPath("")
assert ap.unversioned_path == MetadataPath("")
assert ap.dataset_path != top_level_dataset_tree_path
assert ap.unversioned_path == top_level_dataset_tree_path

# We get the dataset tree for the root version. From this we have to load
# or create a metadata root record for the sub-dataset id and sub-dataset
Expand Down
6 changes: 6 additions & 0 deletions datalad_metalad/dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,9 @@ def show_dataset_metadata(mapper: str,
metadata_root_record: MetadataRootRecord
) -> Generator[dict, None, None]:

if metadata_root_record is None:
return

with ensure_mapped(metadata_root_record):
dataset_level_metadata = metadata_root_record.dataset_level_metadata.read_in()

Expand Down Expand Up @@ -194,6 +197,9 @@ def show_file_tree_metadata(mapper: str,
recursive: bool
) -> Generator[dict, None, None]:

if metadata_root_record is None:
return

with ensure_mapped(metadata_root_record):

dataset_level_metadata = metadata_root_record.dataset_level_metadata
Expand Down
33 changes: 17 additions & 16 deletions datalad_metalad/tests/test_add.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,8 @@
# ## ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ##
"""Test metadata adding"""
import json
import os
import tempfile
import time
from pathlib import Path
from typing import (
List,
Union,
Expand Down Expand Up @@ -79,13 +77,6 @@
}


additional_keys_unknown_template = {
"root_dataset_id": "<unknown>",
"root_dataset_version": "aaaaaaa0000000000000000222222222",
"dataset_path": "sub_0/sub_0.0/dataset_0.0.0"
}


def _assert_raise_mke_with_keys(exception_keys: List[str],
*args,
**kwargs):
Expand Down Expand Up @@ -403,7 +394,9 @@ def test_override_key_allowed(file_name=None):
def _get_top_nodes(git_repo,
dataset_id,
dataset_version,
dataset_tree_path=""):
dataset_tree_path="",
sub_dataset_id=None,
sub_dataset_version=None):

# Ensure that metadata was created
tree_version_list, uuid_set, mrr = \
Expand All @@ -414,8 +407,8 @@ def _get_top_nodes(git_repo,
primary_data_version=dataset_version,
prefix_path=MetadataPath(""),
dataset_tree_path=MetadataPath(dataset_tree_path),
sub_dataset_id=None,
sub_dataset_version=None)
sub_dataset_id=sub_dataset_id,
sub_dataset_version=sub_dataset_version)

assert_is_not_none(tree_version_list)
assert_is_not_none(uuid_set)
Expand Down Expand Up @@ -534,6 +527,8 @@ def test_subdataset_add_dataset_end_to_end(file_name=None):
assert_result_count(res, 0, type='file')

# Verify dataset level metadata was added
dataset_id = UUID(metadata_template["dataset_id"])
dataset_version = metadata_template["dataset_version"]
root_dataset_id = UUID(additional_keys_template["root_dataset_id"])
root_dataset_version = additional_keys_template["root_dataset_version"]
dataset_tree_path = MetadataPath(
Expand All @@ -543,7 +538,9 @@ def test_subdataset_add_dataset_end_to_end(file_name=None):
git_repo,
root_dataset_id,
root_dataset_version,
additional_keys_template["dataset_path"])
additional_keys_template["dataset_path"],
dataset_id,
dataset_version)

_, _, dataset_tree = tree_version_list.get_dataset_tree(
root_dataset_version,
Expand Down Expand Up @@ -585,6 +582,8 @@ def test_subdataset_add_file_end_to_end(file_name=None):
assert_result_count(res, 0, type='dataset')

# Verify dataset level metadata was added
dataset_id = UUID(metadata_template["dataset_id"])
dataset_version = metadata_template["dataset_version"]
root_dataset_id = UUID(additional_keys_template["root_dataset_id"])
root_dataset_version = additional_keys_template["root_dataset_version"]
dataset_tree_path = MetadataPath(
Expand All @@ -594,7 +593,9 @@ def test_subdataset_add_file_end_to_end(file_name=None):
git_repo,
root_dataset_id,
root_dataset_version,
additional_keys_template["dataset_path"])
additional_keys_template["dataset_path"],
dataset_id,
dataset_version)

_, _, dataset_tree = tree_version_list.get_dataset_tree(
root_dataset_version,
Expand Down Expand Up @@ -644,7 +645,7 @@ def test_current_dir_add_end_to_end(file_name=None):

expected = {
**metadata_template,
**additional_keys_unknown_template,
**additional_keys_template,
"type": "dataset",
"dataset_id": str(another_id),
}
Expand Down Expand Up @@ -703,7 +704,7 @@ def test_add_file_dump_end_to_end(file_name=None):

expected = {
**metadata_template,
**additional_keys_unknown_template,
**additional_keys_template,
"type": "file",
"path": test_path,
"dataset_id": str(another_id)
Expand Down
77 changes: 77 additions & 0 deletions docs/source/design/history.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. _history:

******************************************************
MetaLad development history and backward compatibility
******************************************************

Functionality related to metadata has been a part of the DataLad ecosystem from the very start.
However, it underwent several evolutions, and this extension is the most recent state of it.
If you have been an early adopter of the metadata functionalities of DataLad or MetaLad, this section provides an overview of past systems and notable changes for you to assess upgrades and backward-compatibility to legacy metadata.

First-generation metadata
-------------------------

The first generation of metadata commands was implemented in the main ``datalad`` Python package, but barely saw the light of day.
Very early users of DataLad might have caught a glimpse of it.

In the 1st-gen metadata implementation, metadata of a dataset had two levels.
The first one contained the metadata about the actual content of a dataset (generated by DataLad or other processes), the second one was metadata about the dataset itself (generated by DataLad).
The metadata was represented in `RDF <https://en.wikipedia.org/wiki/Resource_Description_Framework>`_.

Second-generation metadata
--------------------------

The second generation of metadata commands came to life when the main ``datalad`` package was a few years old already.
It brought the concept of dedicated _extractors_, including the legacy extractors that are supported to this day.
It also provided a range of dedicated metadata subcommands of a ``datalad metadata`` command such as ``aggregate`` and ``extract``, as well as a dedicated ``datalad search`` command.
Extracted metadata was stored in a dataset in (compressed) files using a JSON
stream format, separately for metadata describing a dataset as a whole, and
metadata describing individual files in a dataset.

The 2nd-gen metadata implementation was moved into the `datalad-deprecated <http://docs.datalad.org/projects/deprecated>`_ extension in 2022.


Third-generation metadata
-------------------------

The third generation of metadata commands was developed as the datalad-extension MetaLad.
Initially, until version ``0.2.1``, it was the continuation of developing 2nd generation metadata functionality.
Afterwards, beginning with ``0.3x`` series, the metadata model and command set was once more revised into the current state 3rd-gen metadata implementation.
This implementation came with an entirely new metadata model.

Gen 2 versus gen 3 metadata
---------------------------

This paragraph is important if you have used ``datalad-metalad`` prior to the ``0.3.0`` release.

Overview of changes
^^^^^^^^^^^^^^^^^^^

The new system in ``0.3.0`` is quite different from the previous release in a few ways:

1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add).

2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created.

3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported.

4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a *metadata-store* which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository.

5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering.

6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported.

7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage)

8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage)

Backward-compatibility
^^^^^^^^^^^^^^^^^^^^^^

Certain versions of MetaLad metadata are temporarily incompatible.

.. note:: Incompability of 0.3.0 and 0.2.x

Please note that the metadata storage format introduced in release ``0.3.0`` is incompatible with the metadata storage formate in previous versions, i.e. `0.2.x`, and those in ``datalad-deprecated``.
Both storage formats can coexist in storage, but version ``0.3.0`` of MetaLad will not be able to read metadata that was stored by the previous version and vice versa.
Eventually there will be an importer that will pull old-version metadata into the new metadata storage.
3 changes: 2 additions & 1 deletion docs/source/design/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ The chapter describes the design of particular subsystems in DataLad.
:maxdepth: 2

conduct
datatypes
datatypes
history
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ python_requires = >= 3.7
install_requires =
six
datalad >= 0.18
datalad-metadata-model >=0.3.6
datalad-metadata-model >=0.3.10
pytest
pyyaml
test_requires =
Expand Down

0 comments on commit 7e8666f

Please sign in to comment.