-
-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
select multiple rows by 1-d array of indices #317
base: master
Are you sure you want to change the base?
Conversation
makes a[i] work as a[i,...]. a[i] crashed if a had more than 1 dimension
I think that it would be better to appropriately implement fancy indexing. |
yeah, thanks for the work @bbudescu but this doesn't sound quite right to me. I need to understand better why this is failing. |
This is related to #310. The work-around that is mentioned there (which is to tile the mask boolean array to match the full array shape) is extremely slow. Also, Even a dirty hack would do it. I'd even be willing to ship a hacked version of PyTables with my software. Currently, this issue is critical for me because my whole program relies on having an efficient way to select arbitrary lines in an array... import tables as tb
import numpy as np
f = tb.openFile("test", "w")
shape = (10000, 100)
a = f.create_earray('/', 'test', obj=np.random.rand(*shape),
chunkshape=(10,100))
ind = np.ones(shape, dtype=np.bool)
%timeit -r1 -n1 a[:]
%timeit -r1 -n1 a[ind].reshape(shape)
assert np.array_equal(a[:], a[ind].reshape(shape))
f.close()
--
1 loops, best of 1: 33.4 ms per loop
1 loops, best of 1: 3.75 s per loop |
I ran your script on my machine and I got similar results. I modified it to use an array of row indices (the hack in this PR allows that without crashing), but the timing appears to be worse. I added the following lines: and got the output: The last one - the longest - is the time for using arrays of row indices. However, you might try to pull this patch on your local copy of pytables and give it a go, as well, just to be sure. However, from the looks of it, if your dataset fits wholly into memory, than perhaps it's faster to just read the whole data from hard drive to memory with a[:] and then select the rows you want from the numpy array in RAM. I presume that it's not the case, but I thought I'd suggest circumventing the erroneous indexing in pytables altogether and use numpy's implementation, if possible, anyway. |
Thanks, your hack seems to make a big difference! Shipping a hacked version of PyTables (supposing your PR won't be merged or this issue won't be solved) in my software would be a small price to pay. Effectively, I cannot really afford to load the whole array in memory. A typical size is 2,000,000 x 500 x 2 of float32 values, or several gigabytes of data... BTW, do you know why Also, I should try the same benchmark with h5py to see if things are better. |
I'm not exactly sure of why this is slower. I'm not a regular developer of pytables. I just needed this functionality in one of my own projects that uses a library that assumes numpy-like behaviour of indexing. However, I've had some performance issues due to another error and I've seen that pytables uses some sort of extra row buffer, which is supposed to speed things up, but perhaps there's another bug in there, somewhere, which makes the mechanism do the exact opposite (the bug that I encountered caused memory overflows, for instance). Anyway, as Anthony Scopatz pointed out in a discussion, there are also some performance issues in the implementation of the low-level indexing operations designed to call stuff from the hdf5 lib, like reading one row at a time instead of larger chunks in some situations. Perhaps this is one of them. I also thought that there would be some kind of optimization happening (like, for instance, sorting the index list and grouping together accesses to consecutive rows), but I found out that it's possible that this doesn't happen. I believe that optimizing is definitely possible, but I think that, as Anthony pointed out, a rewrite of the whole indexing engine is needed (for other purposes as well, like making the interface more pythonic through 'fancy' iterators and more numpy-like). I've seen that pytables is usually concerned with ... tables, and that most operations are defined in terms of row operations, whereas, in order to harness the full performance potential of hdf5, I believe it might be more adequate to use the abstraction of multidimensional chunks. Regarding h5py, I don't have any experience working with it, so I don't know whether it provides better performance. However, I'm also considering giving it a try (at first I decided for pytables primarily because of its blosc filter and numexpr compatiblity, but the latter only works for table operands of the same size, so it doesn't really help me as much as I expected). So, if you do try it, please tell me what you find. Perhaps it will even help improve pytables. |
I very seriously welcome anyone who wants to take a stab at rewriting iteration. This needs to happen at some point. |
I'd really like to help with this, but what I know about disk access performance (like blocks, segments and cache), as well as about the hdf5 format (like caching and block alignment) and the extra caching mechanisms employed by pytables isn't sufficient at this moment. I'll try to extend my knowledge in this field, but I guess it will take a while before actually producing an algorithm for creating an access scheme given a list of elements to retrieve, which is optimal both in terms of access time and memory footprint. I guess it's quite a thorny problem (but intriguing and fun, nonetheless). |
Quick question: is there a cache when accessing an array with indexes? |
I've done some quick benchmarks. h5py does not appear to perform better than PyTables. In any case, using indexing in my application as I would do with NumPy is atrociously slow. However, loading everything in memory with e.g. I'll probably end up writing my own wrapper to |
not sure if the best solution, but it makes a[i] work as a[i,...]. without this patch a[i] crashes if a has more than 1 dimension