Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select multiple rows by 1-d array of indices #317

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bbudescu
Copy link
Contributor

@bbudescu bbudescu commented Jan 9, 2014

not sure if the best solution, but it makes a[i] work as a[i,...]. without this patch a[i] crashes if a has more than 1 dimension

makes a[i] work as a[i,...]. a[i] crashed if a had more than 1 dimension
@bbudescu
Copy link
Contributor Author

bbudescu commented Jan 9, 2014

@scopatz
Copy link
Member

scopatz commented Jan 9, 2014

I think that it would be better to appropriately implement fancy indexing.

@andreabedini
Copy link
Contributor

yeah, thanks for the work @bbudescu but this doesn't sound quite right to me. I need to understand better why this is failing.

@rossant
Copy link

rossant commented Feb 5, 2014

This is related to #310. The work-around that is mentioned there (which is to tile the mask boolean array to match the full array shape) is extremely slow. Also, a[ind] fails if a has more than 1 dimension. So currently, what is the fastest way to select arbitrary lines in a 2D array?

Even a dirty hack would do it. I'd even be willing to ship a hacked version of PyTables with my software. Currently, this issue is critical for me because my whole program relies on having an efficient way to select arbitrary lines in an array...

import tables as tb
import numpy as np
f = tb.openFile("test", "w")
shape = (10000, 100)
a = f.create_earray('/', 'test', obj=np.random.rand(*shape),
                    chunkshape=(10,100))
ind = np.ones(shape, dtype=np.bool)
%timeit -r1 -n1 a[:]
%timeit -r1 -n1 a[ind].reshape(shape)
assert np.array_equal(a[:], a[ind].reshape(shape))
f.close()
--
1 loops, best of 1: 33.4 ms per loop
1 loops, best of 1: 3.75 s per loop

@bbudescu
Copy link
Contributor Author

bbudescu commented Feb 5, 2014

I ran your script on my machine and I got similar results. I modified it to use an array of row indices (the hack in this PR allows that without crashing), but the timing appears to be worse. I added the following lines:
ind_arr = range(len(a))
%timeit -r1 -n1 a[ind_arr]

and got the output:
1 loops, best of 1: 26.8 ms per loop
1 loops, best of 1: 2.22 s per loop
1 loops, best of 1: 166 ms per loop

The last one - the longest - is the time for using arrays of row indices. However, you might try to pull this patch on your local copy of pytables and give it a go, as well, just to be sure.

However, from the looks of it, if your dataset fits wholly into memory, than perhaps it's faster to just read the whole data from hard drive to memory with a[:] and then select the rows you want from the numpy array in RAM. I presume that it's not the case, but I thought I'd suggest circumventing the erroneous indexing in pytables altogether and use numpy's implementation, if possible, anyway.

@rossant
Copy link

rossant commented Feb 5, 2014

Thanks, your hack seems to make a big difference! Shipping a hacked version of PyTables (supposing your PR won't be merged or this issue won't be solved) in my software would be a small price to pay.

Effectively, I cannot really afford to load the whole array in memory. A typical size is 2,000,000 x 500 x 2 of float32 values, or several gigabytes of data...

BTW, do you know why a[ind_arr] is still several times slower than a[:]? Do you think it's possible to achieve better performance? In any case your solution is still vastly better than what PyTables normally accepts... :)

Also, I should try the same benchmark with h5py to see if things are better.

@bbudescu
Copy link
Contributor Author

bbudescu commented Feb 5, 2014

I'm not exactly sure of why this is slower. I'm not a regular developer of pytables. I just needed this functionality in one of my own projects that uses a library that assumes numpy-like behaviour of indexing.

However, I've had some performance issues due to another error and I've seen that pytables uses some sort of extra row buffer, which is supposed to speed things up, but perhaps there's another bug in there, somewhere, which makes the mechanism do the exact opposite (the bug that I encountered caused memory overflows, for instance).

Anyway, as Anthony Scopatz pointed out in a discussion, there are also some performance issues in the implementation of the low-level indexing operations designed to call stuff from the hdf5 lib, like reading one row at a time instead of larger chunks in some situations. Perhaps this is one of them. I also thought that there would be some kind of optimization happening (like, for instance, sorting the index list and grouping together accesses to consecutive rows), but I found out that it's possible that this doesn't happen. I believe that optimizing is definitely possible, but I think that, as Anthony pointed out, a rewrite of the whole indexing engine is needed (for other purposes as well, like making the interface more pythonic through 'fancy' iterators and more numpy-like). I've seen that pytables is usually concerned with ... tables, and that most operations are defined in terms of row operations, whereas, in order to harness the full performance potential of hdf5, I believe it might be more adequate to use the abstraction of multidimensional chunks.

Regarding h5py, I don't have any experience working with it, so I don't know whether it provides better performance. However, I'm also considering giving it a try (at first I decided for pytables primarily because of its blosc filter and numexpr compatiblity, but the latter only works for table operands of the same size, so it doesn't really help me as much as I expected). So, if you do try it, please tell me what you find. Perhaps it will even help improve pytables.

@scopatz
Copy link
Member

scopatz commented Feb 6, 2014

I very seriously welcome anyone who wants to take a stab at rewriting iteration. This needs to happen at some point.

@bbudescu
Copy link
Contributor Author

bbudescu commented Feb 6, 2014

I'd really like to help with this, but what I know about disk access performance (like blocks, segments and cache), as well as about the hdf5 format (like caching and block alignment) and the extra caching mechanisms employed by pytables isn't sufficient at this moment. I'll try to extend my knowledge in this field, but I guess it will take a while before actually producing an algorithm for creating an access scheme given a list of elements to retrieve, which is optimal both in terms of access time and memory footprint. I guess it's quite a thorny problem (but intriguing and fun, nonetheless).

@rossant
Copy link

rossant commented Feb 6, 2014

Quick question: is there a cache when accessing an array with indexes?

@rossant
Copy link

rossant commented Feb 6, 2014

I've done some quick benchmarks. h5py does not appear to perform better than PyTables. In any case, using indexing in my application as I would do with NumPy is atrociously slow. However, loading everything in memory with e.g. a[:,0][ind] is much faster than a[ind,0] (like several hundred times faster!). There's a higher peak memory usage but that should be fine for my particular use case.

I'll probably end up writing my own wrapper to EArray that will implement this trick transparently (maybe with a cache too, unless there already is one in PyTables).

@FrancescAlted FrancescAlted modified the milestones: Next Tasks, 3.2 Apr 27, 2015
@andreabedini andreabedini removed this from the Next Tasks milestone Sep 7, 2015
@jsancho-gpl jsancho-gpl changed the base branch from develop to master June 12, 2018 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants