Fix VLArray to host more than 32-bit long rows #550

FrancescAlted · 2016-06-17T09:06:50Z

This is still preliminary because it segfaults when the number of elements is > 2**32. Any volunteer to look into this?

FrancescAlted · 2016-06-17T09:08:45Z

You can reproduce the segfault by running this test. Please note that you will need a machine with at least 16 GB of RAM to comfortably run this test.

tomkooij · 2016-07-18T21:50:53Z

I had a go with this.

~~Appending 232+1 rows to the vlarray is no problem.~~~Appending a single row of 232 + 1 bytes to the vlarray is no problem. Appending it several times is no problem.
reading the row with vlarray[0] causes the segfault.

I created a big vlarray (as in test05 in this PR) and closed the file.
Restarted IPython and opened the file in read-only mode.
vlarray[0] segfaults.

I narrowed it down to the call to H5Dread() in hdf5extenstions.pyx:_read_array() line 2071
The calls before H5Dread seem okay, at least as far as I can debug.

I tested on HDF5-1.8.16.

andreabedini · 2016-12-09T01:23:47Z

For future reference the line is

PyTables/tables/hdf5extension.pyx

Line 2079 in ecc32b1

ret = H5Dread(self.dataset_id, self.type_id, mem_space_id, space_id,

(by default GitHub links to line 2071 in the current version, which is subject to change, press y to reload the page with the canonical address).

andreabedini · 2017-04-11T08:23:02Z

After another look, I can add a piece of the puzzle. h5dump segfaults on the temporary file created by

import numpy
import tables
h5 = tables.open_file('/tmp/tmp.h5', 'w')
N = int(2**32 + 1)  # > 2**32
vlarray = h5.create_vlarray(h5.root, 'vlarray2', atom=tables.Int8Atom(), filters=None, expectedrows=N)
x = numpy.zeros(N, dtype="i1")
vlarray.append(x)
h5.close()

so the problem might not be in hdf5extenstions.pyx:_read_array but elsewhere, since we are somehow creating a corrupted file.

tomkooij · 2017-04-11T09:25:51Z

I think (guess) h5dump segfaults in the HDF5 library identical to the way _read_array fails in the HDF5 lib (I traced that last one into the library).

That could ofcourse be caused by a corrupted file we create, but it could just as well be a library problem. It's probably an 32bit overflow somewhere, but where?

FrancescAlted · 2017-04-11T12:00:18Z

Yeah, it would be nice if we could create a minimal C program creating the dataset, test that h5dump crashes on it and sending the report back to the HDF Group. But if the problem is on the HDF5 side, then I'd merge this PR as-is.

ClimbsRocks · 2017-06-15T23:06:46Z

Sounds like this might fix an issue I've been running into for a while!

In my case, the issue is caused by having too many string columns, which at some point hit some kind of a cumulative limit.

Any updates on this PR?

louis925 · 2019-02-13T08:56:44Z

Can someone merge this pull request? This problem makes pandas not able to save some type of dataframe into hdf5 file. Thanks!

FrancescAlted · 2019-03-12T11:18:58Z

I revisited this, and I can still reproduce the segfaults described here, even with HDF5 1.10.4, so I am not merging this until more investigation would be done.

Fixed VLArray to host more than 32-bit long rows

153928d

FrancescAlted mentioned this pull request Jun 17, 2016

OverflowError: writing pandas data frame to hdf5 #531

Open

andreabedini self-assigned this Apr 12, 2017

andreabedini removed their assignment Jun 21, 2017

jsancho-gpl changed the base branch from develop to master June 12, 2018 14:02

twmacro mentioned this pull request Sep 6, 2018

Crashes when dealing with large datasets uchicago-cs/deepdish#34

Open

avalentino added the enhancement label Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VLArray to host more than 32-bit long rows #550

Fix VLArray to host more than 32-bit long rows #550

FrancescAlted commented Jun 17, 2016

FrancescAlted commented Jun 17, 2016

tomkooij commented Jul 18, 2016 •

edited

andreabedini commented Dec 9, 2016

andreabedini commented Apr 11, 2017

tomkooij commented Apr 11, 2017 •

edited

FrancescAlted commented Apr 11, 2017

ClimbsRocks commented Jun 15, 2017

louis925 commented Feb 13, 2019

FrancescAlted commented Mar 12, 2019

Fix VLArray to host more than 32-bit long rows #550

Are you sure you want to change the base?

Fix VLArray to host more than 32-bit long rows #550

Conversation

FrancescAlted commented Jun 17, 2016

FrancescAlted commented Jun 17, 2016

tomkooij commented Jul 18, 2016 • edited

andreabedini commented Dec 9, 2016

andreabedini commented Apr 11, 2017

tomkooij commented Apr 11, 2017 • edited

FrancescAlted commented Apr 11, 2017

ClimbsRocks commented Jun 15, 2017

louis925 commented Feb 13, 2019

FrancescAlted commented Mar 12, 2019

tomkooij commented Jul 18, 2016 •

edited

tomkooij commented Apr 11, 2017 •

edited