Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IPython formatters to provide more convinient interface for browsing HDF5 files interactively #1188

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

anthonytw
Copy link

This commit attempts to enable pretty printing of some HDF5 objects.

An example of the changes in use:

In [1]: import h5py

In [2]: f = h5py.File('output.h5', 'r')

In [3]: f
Out[3]: <HDF5 file "output.h5" (mode r)>

In [4]: h5py.enable_ipython_formatter()

In [5]: f
Out[5]:
<HDF5 file "output.h5" (mode r)>
<HDF5 group "/" (2 members)>
<KeysViewHDF5>:
  + numeric
  + waveform

In [6]: f['waveform']
Out[6]:
<HDF5 group "/waveform" (9 members)>
<KeysViewHDF5>:
  + ECG.I
  + ECG.II
  + ECG.III
  + ECG.MCL
  + ECG.V
  + ECG.aVR
  + Pleth
  + PlethT
  + Resp

#566

@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #1188 into master will decrease coverage by 2.05%.
The diff coverage is 5.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1188      +/-   ##
==========================================
- Coverage   83.76%   81.71%   -2.06%     
==========================================
  Files          18       21       +3     
  Lines        2156     2215      +59     
==========================================
+ Hits         1806     1810       +4     
- Misses        350      405      +55
Impacted Files Coverage Δ
h5py/_ipython/utils.py 0% <0%> (ø)
h5py/_ipython/completer.py 0% <0%> (ø)
h5py/_ipython/formatter.py 0% <0%> (ø)
h5py/__init__.py 80.95% <100%> (+21.3%) ⬆️
h5py/_ipython/__init__.py 25% <25%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b6ec06...f38d093. Read the comment docs.

#
# Contributed by Anthony Wertz
#
# http://h5py.alfven.org
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is copy-pasted from files, but the URL doesn't actually work.

Copy link
Member

@takluyver takluyver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think doing it as optional IPython integration is preferable to doing it in the default repr. I still wonder whether it would make more sense to put this functionality in h5glance, which is all about producing nice visual displays of HDF5 objects, and keep that as a separate layer from h5py, which does the much bigger task of wrapping libhdf5.

That said, if it was going into h5glance, where the focus is display and I feel more ownership, I'd definitely get picky about what information is there and how it's formatted. For instance:

  • You list all the keys of a group, but give no indication what they point to, e.g. dataset or subgroup.
  • What happens if the group has a million keys? Should we be limiting how many are shown by default? Admittedly I don't do that in h5glance yet either, but the interfaces h5glance offers don't have the same risk of messing up your terminal buffer.
  • I don't really like that the <HDF5 ...> bit appears to be closed and then extra information appears outside it. If it uses any kind of brackets, I think they should surround the whole representation so that it's clearly grouped together. E.g. it could use the sequence support in IPython's pretty-printer to do something like this:
<HDF5 group "blah" with 10 keys: [
    'apple',
    'banana',
    ...
]>

@@ -44,20 +44,11 @@
import posixpath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_ipython_extension() function in this module means that people could do %load_ext h5py.ipy_completer, or something similar in IPython config. I don't know how many people do, but renaming the module, as you have done here, would break that.

@aragilar
Copy link
Member

aragilar commented Aug 9, 2020

Is this something we're planing on having, or can we put this (or equivalent functionality, if it doesn't already exist) in h5glance? I'm not going to look at rebasing this as there's no tests, and the ipython/jupyter functionally in h5py has tended to bitrot due to lack of interest (and tests).

@takluyver
Copy link
Member

I was actually working just the other day on integrating h5glance's text-based output into IPython: European-XFEL/h5glance#21 .

I still have a slight preference for putting fancy formatting features in a separate package, rather than h5py itself. I've mentioned this elsewhere, but there are two main arguments:

  • Visualisation can be surprisingly complex, and there's no obvious natural stopping point where we can say "this is in scope, that isn't". So if h5py starts doing fancy visualisations, it could become a large bulk of extra code which is never exactly done.
  • Different visualisations may make sense in different cases, e.g. depending on how many nested groups your files have, or what kind of information is stored in attributes. This is easier with space for specialised visualisation tools than if h5py includes an official standard visualisation.

Of course, the counterargument is that there's value in having a single shared visualisation rather than everyone going off and doing their own thing. And I have a conflict of interest because h5py doing this would make h5glance less valuable - so I shouldn't be the one to decide.

@anthonytw
Copy link
Author

I agree with @takluyver on complicated visualizations, but I think it's reasonable (if not expected) to include a minimal set of data structure "visualization" capability in the package itself. In fact, it shouldn't even be relegated to some tucked-away portion of the code like this implementation that has to be manually loaded. And it definitely shouldn't be stuffed in a completely separate package. Both are non-intuitive (and cumbersome) workflows.

Imagine how much less useful pandas or numpy would be if viewing a DataFrame or recarray just told you there were N rows and M columns, or a numpy array that told you there were N elements, but none showed you any data, types, column names, etc. There are much more complex and useful visualizations out there that can't be accomplished with the stock pandas or numpy library directly and those get their own packages (various plotting libraries being the obvious examples), but a simple rendering of a complex (or not so complex) data type is almost universally expected to get a quick look at what you're dealing with. Telling me that an HDF5 group has 2 members (groups? datasets? attributes are not included? no names?) is not terribly useful. And the root file object (which is also a group but prints differently) doesn't tell me anything useful at all, just the file name and mode!

There's no reason listing out the member groups, attributes, and datasets shouldn't be the default behavior, I would go as far as to suggest it's what most people would expect to see when viewing any reasonably presentable data structure in Python. And the features and scope can remain pretty minimal: just print out something that's helpful and descriptive. For an HDF5 file, that's incredibly easy to define. h5glance looks neat and adds a lot of cool features (I like the command line tool with bash integration and notebook integration) and definitely counts as out-of-scope visualization. h5py should be much simpler, but more useful than it currently is.

Just my two cents! 👍

@vasole
Copy link
Contributor

vasole commented Aug 10, 2020

In my opinion h5py itself should not go beyond providing text-based representation(s) of the file structure. There are many tools around capable of complex visualizations and trying to go beyond text-based representations within h5py might withdraw some resources from h5py itself. At most, I would consider the possibility to simplify registering user-specified formatters overwriting a default one provided by h5py. If people jump into it and there is a clear "winner" then why not to adopt it at a later stage.

For instance, for browsing HDF5 files interactively one comfortable thing is to use an external file-browser-like tool allowing to inspect and to cut&paste or drag items into the ipython session. There are several solutions already available and it does not consume any resources from the h5py project.

@aragilar
Copy link
Member

@anthonytw The big issue is tests. Without tests the formatters will break (we've seen this with the ipython completers).

As an aside, a better comparison library would be astropy.io.fits, it wraps the FITS file format (which is substantially simpler than HDF5, in many cases you would load the whole file into memory), rather than either numpy or pandas, which are in-memory objects. astropy.io.fits also provides very little information by default, instead providing a small amount of pretty printing behind some helper functions and scripts (at the level of h5dump).

@anthonytw
Copy link
Author

@vasole I agree on most points, and I think a simple list of attributes showing the structure fits into that mold. However, the issue I wanted to bring up with this (and an earlier PR) is that h5py does not give a text-based representation of the file structure. I think doing so in a simple fashion (e.g., using repr of a file or group to list members) would be incredibly helpful, while being simple to implement and test.

@aragilar Agreed about tests. I initially initiated a PR with the IPython implementation to get feedback after some concerns about a repr implementation were raised.

Also, I think the comparisons with pandas and numpy are still relevant. For any single object (file, group, dataset) for which the text representation would be relevant, its immediate children (groups, datasets, attributes) or structure (e.g., dataset size and columns) should be available in memory, and that's all I propose rendering for an object view. So it's not a view of the actual data, which may or may not be loaded in memory, but of the data's organization within that file.

I think it's a bit cumbersome to figure out (interactively) what's present at an address in the HDF5 file system (first list the members and determine which are groups and which are datasets by type-checking the values, then separately list the attributes, and now you have the full picture). Granted this PR doesn't implement exactly that (I didn't list attributes or distinguish between groups and datasets as @takluyver pointed out) but I wanted to propose something to get feedback (which I see now I missed from last year) and make sure the group approved of my implementation of the ipython injection since I was not particularly familiar with that mechanism.

However, if it's not going to sit in the repr, it might be best to just drop the request. I'm not sure it even makes sense in h5glance which, at a ... glance (-_-') wouldn't affect the default views of h5py objects anyway. Another option might be to implement a method like "members" which returns a list of subgroups, datasets, and attribute as a dictionary without affecting the repr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants