select multiple rows by 1-d array of indices #317

bbudescu · 2014-01-09T13:12:46Z

not sure if the best solution, but it makes a[i] work as a[i,...]. without this patch a[i] crashes if a has more than 1 dimension

makes a[i] work as a[i,...]. a[i] crashed if a had more than 1 dimension

bbudescu · 2014-01-09T13:15:13Z

https://groups.google.com/forum/#!topic/pytables-users/h9IjRLZhNEo

scopatz · 2014-01-09T21:55:57Z

I think that it would be better to appropriately implement fancy indexing.

andreabedini · 2014-01-09T22:50:14Z

yeah, thanks for the work @bbudescu but this doesn't sound quite right to me. I need to understand better why this is failing.

rossant · 2014-02-05T13:38:58Z

This is related to #310. The work-around that is mentioned there (which is to tile the mask boolean array to match the full array shape) is extremely slow. Also, a[ind] fails if a has more than 1 dimension. So currently, what is the fastest way to select arbitrary lines in a 2D array?

Even a dirty hack would do it. I'd even be willing to ship a hacked version of PyTables with my software. Currently, this issue is critical for me because my whole program relies on having an efficient way to select arbitrary lines in an array...

import tables as tb
import numpy as np
f = tb.openFile("test", "w")
shape = (10000, 100)
a = f.create_earray('/', 'test', obj=np.random.rand(*shape),
                    chunkshape=(10,100))
ind = np.ones(shape, dtype=np.bool)
%timeit -r1 -n1 a[:]
%timeit -r1 -n1 a[ind].reshape(shape)
assert np.array_equal(a[:], a[ind].reshape(shape))
f.close()
--
1 loops, best of 1: 33.4 ms per loop
1 loops, best of 1: 3.75 s per loop

bbudescu · 2014-02-05T17:01:40Z

I ran your script on my machine and I got similar results. I modified it to use an array of row indices (the hack in this PR allows that without crashing), but the timing appears to be worse. I added the following lines:
ind_arr = range(len(a))
%timeit -r1 -n1 a[ind_arr]

and got the output:
1 loops, best of 1: 26.8 ms per loop
1 loops, best of 1: 2.22 s per loop
1 loops, best of 1: 166 ms per loop

The last one - the longest - is the time for using arrays of row indices. However, you might try to pull this patch on your local copy of pytables and give it a go, as well, just to be sure.

However, from the looks of it, if your dataset fits wholly into memory, than perhaps it's faster to just read the whole data from hard drive to memory with a[:] and then select the rows you want from the numpy array in RAM. I presume that it's not the case, but I thought I'd suggest circumventing the erroneous indexing in pytables altogether and use numpy's implementation, if possible, anyway.

rossant · 2014-02-05T18:42:00Z

Thanks, your hack seems to make a big difference! Shipping a hacked version of PyTables (supposing your PR won't be merged or this issue won't be solved) in my software would be a small price to pay.

Effectively, I cannot really afford to load the whole array in memory. A typical size is 2,000,000 x 500 x 2 of float32 values, or several gigabytes of data...

BTW, do you know why a[ind_arr] is still several times slower than a[:]? Do you think it's possible to achieve better performance? In any case your solution is still vastly better than what PyTables normally accepts... :)

Also, I should try the same benchmark with h5py to see if things are better.

bbudescu · 2014-02-05T21:18:18Z

I'm not exactly sure of why this is slower. I'm not a regular developer of pytables. I just needed this functionality in one of my own projects that uses a library that assumes numpy-like behaviour of indexing.

However, I've had some performance issues due to another error and I've seen that pytables uses some sort of extra row buffer, which is supposed to speed things up, but perhaps there's another bug in there, somewhere, which makes the mechanism do the exact opposite (the bug that I encountered caused memory overflows, for instance).

Anyway, as Anthony Scopatz pointed out in a discussion, there are also some performance issues in the implementation of the low-level indexing operations designed to call stuff from the hdf5 lib, like reading one row at a time instead of larger chunks in some situations. Perhaps this is one of them. I also thought that there would be some kind of optimization happening (like, for instance, sorting the index list and grouping together accesses to consecutive rows), but I found out that it's possible that this doesn't happen. I believe that optimizing is definitely possible, but I think that, as Anthony pointed out, a rewrite of the whole indexing engine is needed (for other purposes as well, like making the interface more pythonic through 'fancy' iterators and more numpy-like). I've seen that pytables is usually concerned with ... tables, and that most operations are defined in terms of row operations, whereas, in order to harness the full performance potential of hdf5, I believe it might be more adequate to use the abstraction of multidimensional chunks.

Regarding h5py, I don't have any experience working with it, so I don't know whether it provides better performance. However, I'm also considering giving it a try (at first I decided for pytables primarily because of its blosc filter and numexpr compatiblity, but the latter only works for table operands of the same size, so it doesn't really help me as much as I expected). So, if you do try it, please tell me what you find. Perhaps it will even help improve pytables.

scopatz · 2014-02-06T13:28:02Z

I very seriously welcome anyone who wants to take a stab at rewriting iteration. This needs to happen at some point.

bbudescu · 2014-02-06T13:51:08Z

I'd really like to help with this, but what I know about disk access performance (like blocks, segments and cache), as well as about the hdf5 format (like caching and block alignment) and the extra caching mechanisms employed by pytables isn't sufficient at this moment. I'll try to extend my knowledge in this field, but I guess it will take a while before actually producing an algorithm for creating an access scheme given a list of elements to retrieve, which is optimal both in terms of access time and memory footprint. I guess it's quite a thorny problem (but intriguing and fun, nonetheless).

rossant · 2014-02-06T16:52:40Z

Quick question: is there a cache when accessing an array with indexes?

rossant · 2014-02-06T17:15:08Z

I've done some quick benchmarks. h5py does not appear to perform better than PyTables. In any case, using indexing in my application as I would do with NumPy is atrociously slow. However, loading everything in memory with e.g. a[:,0][ind] is much faster than a[ind,0] (like several hundred times faster!). There's a higher peak memory usage but that should be fine for my particular use case.

I'll probably end up writing my own wrapper to EArray that will implement this trick transparently (maybe with a cache too, unless there already is one in PyTables).

select rows by index list

bd5b83b

makes a[i] work as a[i,...]. a[i] crashed if a had more than 1 dimension

scopatz mentioned this pull request Mar 4, 2014

Issue with __setitem__ in tables.Column regarding single-dimensional shapes #338

Closed

FrancescAlted modified the milestones: Next Tasks, 3.2 Apr 27, 2015

andreabedini removed this from the Next Tasks milestone Sep 7, 2015

jsancho-gpl changed the base branch from develop to master June 12, 2018 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

select multiple rows by 1-d array of indices #317

select multiple rows by 1-d array of indices #317

bbudescu commented Jan 9, 2014

bbudescu commented Jan 9, 2014

scopatz commented Jan 9, 2014

andreabedini commented Jan 9, 2014

rossant commented Feb 5, 2014

bbudescu commented Feb 5, 2014

rossant commented Feb 5, 2014

bbudescu commented Feb 5, 2014

scopatz commented Feb 6, 2014

bbudescu commented Feb 6, 2014

rossant commented Feb 6, 2014

rossant commented Feb 6, 2014

select multiple rows by 1-d array of indices #317

Are you sure you want to change the base?

select multiple rows by 1-d array of indices #317

Conversation

bbudescu commented Jan 9, 2014

bbudescu commented Jan 9, 2014

scopatz commented Jan 9, 2014

andreabedini commented Jan 9, 2014

rossant commented Feb 5, 2014

bbudescu commented Feb 5, 2014

rossant commented Feb 5, 2014

bbudescu commented Feb 5, 2014

scopatz commented Feb 6, 2014

bbudescu commented Feb 6, 2014

rossant commented Feb 6, 2014

rossant commented Feb 6, 2014