-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: dandi filenames with +acquisition
data
#1265
Comments
@colleenjg - any reason not to put them in the same nwb file ? ps. we are currently debating whether we create a special folder for derivatives in the dandiset, or allow for separate dandisets, one containing raw data, and another containing data after processing (cell extractions, spikes, etc.,.) |
@satra yeah - it's really been my on-going debate about whether to have multiple version of the files or just one. A lot of the compute I use is offline (clusters with offline compute nodes), so streaming isn't always an option. And in general, I haven't moved to fully remote compute - I still do a lot locally, as do a lot of my colleagues. My full dataset is 2.2 TB with the imaging stacks, which is just a massive amount of data to download and store. It's a very manageable 15 GB without the imaging stacks, which I'm keen to keep the "basic" file versions on Dandi. |
in that case, let's do an experiment, would you be willing to put the raw data into a separate dandiset? and then we can link the derived data to it. i can show you how at the asset level. on the dandiset level, it's going to be via resource links. |
yeah! Happy to give it a try! |
@CodyCBakerPhD, @bendichter - any suggestions. there doesn't seem to be a neural datatype distinction we can make. perhaps something to consider in the schema. this has come up a few times. it's similar to how cameras store raw, raw+jpeg, jpeg. |
@satra there actually are neurodata type distinctions here. Raw ophys data has the neurodata type By the way, the same is true for extracellular electrophysiology. Raw has |
thanks @bendichter. @yarikoptic - seems like we should do a remapping. this was done very early in dandi's lifecycle. i'll open a separate issue for tackling this as well as techniques and modalities together. |
I think that would be really great for my use case, and hopefully many others. |
One thing I will mention while endeavoring upon this experiment - please be sure that there is no difference between the 'acquisition-only' files and any derivatives in terms of the metadata structure; the only difference being the absence of the series containing the bulk data in downstream derivatives In particular, please ensure that all of the hierarchical metadata belonging to that series (which is very lightweight) remains in any processed files; that is, imaging planes, optic channel info, devices (for ecephys; the electrode table, electrode groupings, devices) That metadata remains very informative and valuable for interpreting any derived data, even in the absence of the actual raw series |
@jwodder here's the conversation we were discussing. Could you point me to the place where the mapping is made between neurodata types and filenames? |
@satra we also need to converge on a naming convention. I proposed "ophys_raw" "ophys_processed" and "ophys_raw_and_processed" but I'm not sure that's the best |
I don't understand the question. Can you give an example of the mapping you're referring to? |
dataset contains ElectricalSeries -> suffix = "ecephys" |
@bendichter That's not defined by dandi anywhere. I think those strings may come from the pynwb metadata. |
@jwodder - somewhere in |
I think this dictionary is used: Lines 748 to 935 in cfb39be
|
@satra If by "suffixes" you mean the filename substring of the form |
@bendichter That dict is only used when converting NWB metadata to Dandi Schema asset metadata. It plays no role in file organization. |
thank you @jwodder. @bendichter the populate modalities function and it's underlying call to get at the modalities from pynwb is what you are looking for at the moment. |
oh, ok I get it. In |
@bendichter - you can also add more info to that dictionary you pointed to in metadata.py if that helps and use it in that function. we do want to keep that suffix space somewhat compact though especially given nwb can store a lot of different types of info. |
even better if a dictionary like that comes from pynwb itself !! :) |
I think what we are looking for here is a distinction between raw and processed data. One way to do this would be to look at neurodata types. PyNWB does not make this distinction, so we would need to do it ourselves. Another way, which is probably more accurate, is to look at the location of the neurodata object within the file, whether the data is in |
@bendichter - i think your knowledge of nwb will help in creating that function. i think i was saying that we can start with putting it in the CLI, but perhaps some of these nwb things could go into pynwb itself, and that way others could use it to. |
@bendichter - are you going to send a PR for this? just checking so that @colleenjg can have a version before she needs to upload the data. |
@satra yes, I'm working on it. To do this right we would need to base the name creation on the NWB file rather than the metadata extracted from the NWB file, so it's going to take some time. I can try to have this done by end of week. |
@bendichter @satra End of week would be great for me - that should give me a few days to upload the large files with the raw imaging data |
How about this for a naming convention: 'ecephys(acq)' |
After thinking more about it, @satra, you were right. In @colleenjg 's particular case we could have looked solely at neurodata types, but if we want to solve the problem more generally we will need to look not only at what neurodata types are present but also at where those types are in the file. Specifically, acquisition data can be found in the Here's the code: import importlib
import inspect
import pynwb
module_names = ("ecephys", "ophys", "icephys", "image", "retinotopy", "ogen", "behavior")
module_map = dict()
for module_name in module_names:
module = importlib.import_module("pynwb." + module_name)
for x in module.__dict__.values():
if inspect.isclass(x) and \
issubclass(x, pynwb.core.NWBMixin) and \
x.__module__ == "pynwb." + module_name:
module_map[x] = module_name
def modalities_string(nwbfile: pynwb.file.NWBFile):
def get_processing_neurodata_objects(nwbfile: pynwb.file.NWBFile):
for mod in nwbfile.processing.values():
for data_interface in mod.data_interfaces.values():
yield data_interface
def check_for_module(list_of_ndobjects: list, module_name: str):
for ndobj in list_of_ndobjects:
for cls in type(ndobj).mro():
if module_map.get(type(ndobj), None) == module_name:
return True
processing_neurodata_objects = list(get_processing_neurodata_objects(nwbfile))
module_strs = []
for module_name in module_names:
acq = check_for_module(nwbfile.acquisition.values(), module_name)
proc = check_for_module(processing_neurodata_objects, module_name)
# handle the Units table, which is an exception to the rule.
if module_name == "ecephys" and nwbfile.units is not None:
proc = True
if not (acq or proc):
continue
elif acq and not proc:
module_strs.append(f"{module_name}(acq)")
elif not acq and proc:
module_strs.append(f"{module_name}(proc)")
else:
module_strs.append(f"{module_name}(acq+proc)")
return "+".join(module_strs) here are some tests: from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.ophys import mock_TwoPhotonSeries, mock_OnePhotonSeries, mock_DfOverF
from pynwb.testing.mock.behavior import mock_SpatialSeries
from pynwb.ecephys import ElectricalSeries
# ecephys acquisition
nwbfile = mock_NWBFile()
nwbfile.add_acquisition(mock_ElectricalSeries())
assert modalities_string(nwbfile) == 'ecephys(acq)'
# ecephys processing
nwbfile2 = mock_NWBFile()
proc_mod = nwbfile2.create_processing_module("ecephys", "ecephys")
#proc_mod.add(mock_ElectricalSeries())
nwbfile2.add_unit(spike_times=[1.0, 2.0, 3.0])
assert modalities_string(nwbfile2) == 'ecephys(proc)'
# ecephys acquisition and processing
nwbfile3 = mock_NWBFile()
nwbfile3.add_acquisition(mock_ElectricalSeries())
proc_mod = nwbfile3.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile3) == 'ecephys(acq+proc)'
# ecephys acquisition and processing and ophys acq
nwbfile4 = mock_NWBFile()
nwbfile4.add_acquisition(mock_ElectricalSeries())
nwbfile4.add_acquisition(mock_TwoPhotonSeries())
proc_mod = nwbfile4.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile4) == 'ecephys(acq+proc)+ophys(acq)'
# ophys processing and behavior processing
nwbfile5 = mock_NWBFile()
proc_mod = nwbfile5.create_processing_module("ophys", "ophys")
proc_mod.add(mock_DfOverF())
proc_mod2 = nwbfile5.create_processing_module("behavior", "behavior")
proc_mod2.add(mock_SpatialSeries())
assert modalities_string(nwbfile5) == 'ophys(proc)+behavior(proc)' @jwodder do you have ideas about what might be the best way to incorporate this into |
@bendichter - nice! perhaps replace the
also in terms of the code, you could extract those strings when extracting metadata, and forgo the mapping function. there is a specific raw nwb metadata extractor function. |
Here's a version that uses dashes instead of parens: import importlib
import inspect
import pynwb
module_names = ("ecephys", "ophys", "icephys", "image", "retinotopy", "ogen", "behavior")
module_map = dict()
for module_name in module_names:
module = importlib.import_module("pynwb." + module_name)
for x in module.__dict__.values():
if inspect.isclass(x) and \
issubclass(x, pynwb.core.NWBMixin) and \
x.__module__ == "pynwb." + module_name:
module_map[x] = module_name
def modalities_string(nwbfile):
def get_processing_neurodata_objects(nwbfile):
for mod in nwbfile.processing.values():
for data_interface in mod.data_interfaces.values():
yield data_interface
def check_for_module(list_of_ndobjects: list, module_name: str):
for ndobj in list_of_ndobjects:
for cls in type(ndobj).mro():
if module_map.get(type(ndobj), None) == module_name:
return True
processing_neurodata_objects = list(get_processing_neurodata_objects(nwbfile))
module_strs = []
for module_name in module_names:
acq = check_for_module(nwbfile.acquisition.values(), module_name)
proc = check_for_module(processing_neurodata_objects, module_name)
# handle the Units table, which is an exception to the rule.
if module_name == "ecephys" and nwbfile.units is not None:
proc = True
if not (acq or proc):
continue
elif acq and not proc:
module_strs.append(f"{module_name}-acq")
elif not acq and proc:
module_strs.append(f"{module_name}-proc")
else:
module_strs.append(f"{module_name}-acq-proc")
return "+".join(module_strs)
tests: from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.ophys import mock_TwoPhotonSeries, mock_OnePhotonSeries, mock_DfOverF
from pynwb.testing.mock.behavior import mock_SpatialSeries
from pynwb.ecephys import ElectricalSeries
# ecephys acquisition
nwbfile = mock_NWBFile()
nwbfile.add_acquisition(mock_ElectricalSeries())
assert modalities_string(nwbfile) == 'ecephys-acq'
# ecephys processing
nwbfile2 = mock_NWBFile()
proc_mod = nwbfile2.create_processing_module("ecephys", "ecephys")
#proc_mod.add(mock_ElectricalSeries())
nwbfile2.add_unit(spike_times=[1.0, 2.0, 3.0])
assert modalities_string(nwbfile2) == 'ecephys-proc'
# ecephys acquisition and processing
nwbfile3 = mock_NWBFile()
nwbfile3.add_acquisition(mock_ElectricalSeries())
proc_mod = nwbfile3.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile3) == 'ecephys-acq-proc'
# ecephys acquisition and processing and ophys acq
nwbfile4 = mock_NWBFile()
nwbfile4.add_acquisition(mock_ElectricalSeries())
nwbfile4.add_acquisition(mock_TwoPhotonSeries())
proc_mod = nwbfile4.create_processing_module("ecephys", "ecephys")
proc_mod.add(mock_ElectricalSeries())
assert modalities_string(nwbfile4) == 'ecephys-acq-proc+ophys-acq'
# ophys processing and behavior processing
nwbfile5 = mock_NWBFile()
proc_mod = nwbfile5.create_processing_module("ophys", "ophys")
proc_mod.add(mock_DfOverF())
proc_mod2 = nwbfile5.create_processing_module("behavior", "behavior")
proc_mod2.add(mock_SpatialSeries())
assert modalities_string(nwbfile5) == 'ophys-proc+behavior-proc' |
@bendichter Is this code supposed to completely replace the current modalities calculation or just augment it somehow? Specifically, is the goal to append |
This is meant to replace a section of
Yes, that's the goal. |
@bendichter I would start updating the code as follows: diff --git a/dandi/metadata.py b/dandi/metadata.py
index 2597a32..2b126ef 100644
--- a/dandi/metadata.py
+++ b/dandi/metadata.py
@@ -53,6 +53,7 @@ lgr = get_logger()
def get_metadata(
path: str | Path | Readable,
digest: Optional[Digest] = None,
+ modalities: bool = False,
) -> Optional[dict]:
"""
Get "flatdata" from a .nwb file
@@ -127,7 +128,7 @@ def get_metadata(
tried_imports = set()
while True:
try:
- meta.update(_get_pynwb_metadata(r))
+ meta.update(_get_pynwb_metadata(r, modalities=modalities))
break
except KeyError as exc: # ATM there is
lgr.debug("Failed to read %s: %s", r, exc)
diff --git a/dandi/organize.py b/dandi/organize.py
index 32dcadc..8894d14 100644
--- a/dandi/organize.py
+++ b/dandi/organize.py
@@ -108,9 +108,6 @@ def create_unique_filenames_from_metadata(
# Additional fields
#
- # Add "modalities" composed from the ones we could deduce
- _populate_modalities(metadata)
-
# handle cases where session_id was not provided
# In some of those we could have session_start_time, so we could produce
# session_id based on those
@@ -401,28 +398,6 @@ def _sanitize_value(value, field):
return value
-def _populate_modalities(metadata):
- ndtypes_to_modalities = get_neurodata_types_to_modalities_map()
- ndtypes_unassigned = set()
- for r in metadata:
- mods = set()
- nd_types = r.get("nd_types", [])
- if isinstance(nd_types, str):
- nd_types = nd_types.split(",")
- for nd_rec in nd_types:
- # split away the count
- ndtype = nd_rec.split()[0]
- mod = ndtypes_to_modalities.get(ndtype, None)
- if mod:
- if mod not in ("base", "device", "file", "misc"):
- # skip some trivial/generic ones
- mods.add(mod)
- else:
- ndtypes_unassigned.add(ndtype)
- # tuple so we could easier figure out "unique" values below
- r["modalities"] = tuple(sorted(mods.union(set(r.get("modalities", {})))))
-
-
def _populate_session_ids_from_time(metadata):
ses_times = [m.get("session_start_time", None) for m in metadata]
if not all(ses_times):
@@ -799,7 +774,7 @@ def organize(
def _get_metadata(path):
try:
- meta = get_metadata(path)
+ meta = get_metadata(path, modalities=True)
except Exception as exc:
meta = {}
failed.append(path)
diff --git a/dandi/pynwb_utils.py b/dandi/pynwb_utils.py
index 5e1afff..d843db5 100644
--- a/dandi/pynwb_utils.py
+++ b/dandi/pynwb_utils.py
@@ -202,7 +202,7 @@ def _scan_neurodata_types(grp: h5py.File) -> List[Tuple[Any, Any]]:
return out
-def _get_pynwb_metadata(path: str | Path | Readable) -> dict[str, Any]:
+def _get_pynwb_metadata(path: str | Path | Readable, modalities: bool = False) -> dict[str, Any]:
out = {}
with open_readable(path) as fp, h5py.File(fp) as h5, NWBHDF5IO(
file=h5, load_namespaces=True
@@ -255,6 +255,9 @@ def _get_pynwb_metadata(path: str | Path | Readable) -> dict[str, Any]:
# get external_file data:
out["external_file_objects"] = _get_image_series(nwb)
+ if modalities:
+ out["modalities"] = modalities_string(nwb)
+
return out
@@ -554,3 +557,7 @@ def open_readable(r: str | Path | Readable) -> IO[bytes]:
return r.open()
else:
return open(r, "rb")
+
+
+def modalities_string(nwbfile: pynwb.NWBFile) -> str:
+ ### Your code here ### |
Sorry for the delay in joining this conversation -- was traveling etc. My thinking:
If we agree that we should just add |
@yarikoptic Note that the code for validating that asset paths conform to |
@yarikoptic - i think the challenge here that is different from bids is that the same file may have multiple types of data, each in raw + proc form, so there is still a need of modality specific information, which i don't think can easily be added in what happens with datasets that only share spike times? are you going to disallow it?
@jwodder - if that's the case dandi cli has changed. as we currently should have support for |
@satra The |
why? The point IMHO is that if we really want to add per-modality entities, we need to come up with some generic approach generalizing to other entities etc, not just limit it only for
nope -- are those spikes for ecephys? then |
Thanks! happy to hear that we made it so specific -- it is easier to relax than to make more restrictive. |
@satra just so I understand, are you talking here about the file naming convention, or are you saying that you think the NWB file itself should not mix acquisition and processing data between modalities? @Yarik, Let me see if I understand some of your points:
I don't see a problem with that on my end.
I don't see a problem with that from my end either. In fact, this alone could solve @colleenjg's problem, as she would be able to add
# extract_desc.py
function extract_desc(nwbfile):
if ...:
return "acq"
elif ...:
return "proc"
else:
return "" but I'm not sure if this would really be better than simply creating a custom script that iterates over and renames the files. Something like: from glob import glob
from pynwb import NWBHDF5IO
import os
desc_dict = dict()
for nwb_filepath in glob("data/*/**.nwb", recursive=True):
with NWBHDF5IO(nwb_filepath, mode="r", load_namespaces=True) as io:
nwbfile = io.read()
desc_dict[nwb_filepath] = extract_desc(nwbfile)
for src, desc in desc_dict.items():
os.rename(src, src[:-4] + desc + ".nwb") Is this along the lines of what you were thinking? |
@satra and @yarikoptic Given the time crunch I'm under, I think I'll go ahead and tag the files with the raw ophys data with something like Do you have a preference for what string I might use? I aim to do this tomorrow (Tuesday) at the latest. |
quick one for @colleenjg : if you just use |
I am adding a new version for each session in my Dandiset. This version has the full optical physiology video added as an acquisition object. The resulting files are about 40 GB bigger than the original versions, which is why I'd like them both to be available on the archive.
Running
dandi organize
, however, results in this message:1 out of 2 paths are not unique. We will try adding _obj- based on crc32 of object_id
.Given that the acquisition module is typically a very major component of a file, is there any way to have the file names reflect whether it is present, e.g.
+acquisition
(just like there is currently+behavior
,+ophys
and+image
)? In my case, this would make interpreting the files versions in my dandiset much more straightforward than the more arbitrary object_id-related identifier.The text was updated successfully, but these errors were encountered: