Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VLArray to host more than 32-bit long rows #550

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

FrancescAlted
Copy link
Member

This is still preliminary because it segfaults when the number of elements is > 2**32. Any volunteer to look into this?

@FrancescAlted
Copy link
Member Author

You can reproduce the segfault by running this test. Please note that you will need a machine with at least 16 GB of RAM to comfortably run this test.

@tomkooij
Copy link
Contributor

tomkooij commented Jul 18, 2016

I had a go with this.

~~Appending 232+1 rows to the vlarray is no problem.~~~Appending a single row of 232 + 1 bytes to the vlarray is no problem. Appending it several times is no problem.
reading the row with vlarray[0] causes the segfault.

I created a big vlarray (as in test05 in this PR) and closed the file.
Restarted IPython and opened the file in read-only mode.
vlarray[0] segfaults.

I narrowed it down to the call to H5Dread() in hdf5extenstions.pyx:_read_array() line 2071
The calls before H5Dread seem okay, at least as far as I can debug.

I tested on HDF5-1.8.16.

@andreabedini
Copy link
Contributor

For future reference the line is

ret = H5Dread(self.dataset_id, self.type_id, mem_space_id, space_id,
(by default GitHub links to line 2071 in the current version, which is subject to change, press y to reload the page with the canonical address).

@andreabedini
Copy link
Contributor

After another look, I can add a piece of the puzzle. h5dump segfaults on the temporary file created by

import numpy
import tables
h5 = tables.open_file('/tmp/tmp.h5', 'w')
N = int(2**32 + 1)  # > 2**32
vlarray = h5.create_vlarray(h5.root, 'vlarray2', atom=tables.Int8Atom(), filters=None, expectedrows=N)
x = numpy.zeros(N, dtype="i1")
vlarray.append(x)
h5.close()

so the problem might not be in hdf5extenstions.pyx:_read_array but elsewhere, since we are somehow creating a corrupted file.

@tomkooij
Copy link
Contributor

tomkooij commented Apr 11, 2017

I think (guess) h5dump segfaults in the HDF5 library identical to the way _read_array fails in the HDF5 lib (I traced that last one into the library).

That could ofcourse be caused by a corrupted file we create, but it could just as well be a library problem. It's probably an 32bit overflow somewhere, but where?

@FrancescAlted
Copy link
Member Author

Yeah, it would be nice if we could create a minimal C program creating the dataset, test that h5dump crashes on it and sending the report back to the HDF Group. But if the problem is on the HDF5 side, then I'd merge this PR as-is.

@andreabedini andreabedini self-assigned this Apr 12, 2017
@ClimbsRocks
Copy link

Sounds like this might fix an issue I've been running into for a while!

In my case, the issue is caused by having too many string columns, which at some point hit some kind of a cumulative limit.

Any updates on this PR?

@andreabedini andreabedini removed their assignment Jun 21, 2017
@jsancho-gpl jsancho-gpl changed the base branch from develop to master June 12, 2018 14:02
@louis925
Copy link

Can someone merge this pull request? This problem makes pandas not able to save some type of dataframe into hdf5 file. Thanks!

@FrancescAlted
Copy link
Member Author

I revisited this, and I can still reproduce the segfaults described here, even with HDF5 1.10.4, so I am not merging this until more investigation would be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants