Large JSON and GEOMETRY objects #3016

katzyn · 2021-01-24T14:09:38Z

In the current H2 sizes of JSON and GEOMETRY objects are limited to 1,048,576 bytes. Sometimes it isn't enough (orbisgis/h2gis#1179).

Hypothetically we can make this limit configurable, but from my point of view all really large objects must be off-heap on the server side (excluding possible usage in user-defined functions, we can't control them). It means they should be stored in the same way as LOBs.

There are two possible approaches. We can allocate a couple of new data types (JSON LARGE OBJECT and GEOMETRY LARGE OBJECT) or can add some storage specifiers (similar to FORMAT JSON) to existing CHARACTER LARGE OBJECT and BINARY LARGE OBJECT data types. I think new data types are less intrusive. We also can convert these data types to LOB data types, but with large in-place limit.

@grandinj
@ebocher

The text was updated successfully, but these errors were encountered:

ebocher · 2021-01-24T14:17:03Z

Thanks. What do you mean by ''allocate a couple of new data types (JSON LARGE OBJECT and GEOMETRY LARGE OBJECT) ''

Does this mean that we will have to use this type in the sql syntax ? Or is it an internal type for H2 to adapt the behaviour of H2 when the geometry is too big ?

@nicolas-f @SPalominos

katzyn · 2021-01-24T14:29:30Z

This first option is about new data types allowing larger values but possibly not usable by some H2's functions and indexes. Spatial indexes may be supported, because they use only 2D bounding box. Regular indexes on these values aren't really usable.

PostgreSQL is far from the Standard and other database systems here. It doesn't have standard binary string and large object data types, but has various non-standard features. PostGIS simply uses these features. But H2 is not the PostgreSQL.

The third option is about conversion of these data types to LOB data types under the same names (JSON and GEOMETRY).

ebocher · 2021-01-24T14:35:53Z

Event if PostgreSQL is far the Standard, geometry is a standardised data type. I'm not in favour of a specific data type. This will break compatibility with other spatial extensions like PostGIS and be confused for the user. So I'm more comfortable with the third option. If I remember correctly. The first versions of H2gis used the LOB format.

SPalominos · 2021-01-25T07:40:15Z

I'm also in favor to keep the GEOMETRY type as it is defined in the standard, in order to stay compatible with other spatial DB or library and to avoid confusion between a custom H2 GEOMETRY type and the standardized one.
@katzyn What will be the impact of the conversion of the GEOMETRY and JSON types into LOB on their indexation?

grandinj · 2021-01-25T07:43:33Z

This

We also can convert these data types to LOB data types, but with large in-place limit.
is probably the best option

nicolas-f · 2021-01-25T08:21:58Z

Using Lob is a great idea, it will reduce the case of multiple in-memory storage of the geometry. We may still have to do some modifications in H2GIS code in order to accommodate to the stream nature of this data type..

katzyn · 2021-01-25T12:15:24Z

What will be the impact of the conversion of the GEOMETRY and JSON types into LOB on their indexation?

Spatial indexes on GEOMETRY columns will work, other indexes will be disallowed.

as it is defined in the standard

I don't see this data type in the SQL Standard and Standard actually doesn't cover data types even for standard functionality, such as JSON.

Some DBMS, however, have it, but its parameters are different. In the SQL Server this data type doesn't have any parameters. PostGIS extension has some parameters. Recent versions of H2 have parameters copied from PostGIS with minor intentional difference in SRID handling.

It looks like in SQL Server this data type uses external storage too.

BTW, maybe add additional optional maximum encoded length paramater to GEOMETRY? Something like
GEOMETRY [([length[lengthUnit], geometryType [, srid])]:

GEOMETRY
GEOMETRY(10M)
GEOMETRY(POINT Z)
GEOMETRY(POINT Z, 4326)
GEOMETRY(10M, POINT Z)
GEOMETRY(10M, POINT Z, 4326)

H2 had it in the past, but it was removed some time ago. LOBs in H2 aren't limited to 2GiB, they can be much larger, but in some cases developers may want to disallow too large values.

OK, let's convert them into specialized LOBs. Our own validators / normalizers of JSON and geometry values should be able to handle streamed data larger than amount of available memory after minor modifications, they don't construct whole objects in memory.

katzyn · 2021-01-25T12:22:29Z

In H2 with default settings BLOBs not larger than 256 bytes are stored directly in the row, CLOBs have the same limit, but I don't remember is it set in bytes (in UTF-8 encoding) or in characters. This limit is configurable.

Should JSON use the same default limit and configuration option?
Should GEOMETRY use the same default limit and configuration option?

If not, we need to choose limits for them.

ebocher · 2021-01-25T16:26:46Z

I don't see this data type in the SQL Standard and Standard actually doesn't cover data types even for standard functionality, such as JSON.

You are right. PostGIS uses the geometry type names as defined by SQL MM ( Figure: SQL Geometry Type hierarchy). Oracle uses SDO_GEOMETRY, spatialite has also its own definition and Mysql is closed to postgis type.

ebocher · 2021-01-25T16:37:13Z

In H2 with default settings BLOBs not larger than 256 bytes are stored directly in the row, CLOBs have the same limit, but I don't remember is it set in bytes (in UTF-8 encoding) or in characters. This limit is configurable.
1. Should `JSON` use the same default limit and configuration option?

2. Should `GEOMETRY` use the same default limit and configuration option?
If not, we need to choose limits for them.

It seems for POSTGIS : the maximum value of a 32-bit signed integer (2,147,483,647) (defined in GSERIALIZED structure)

ebocher · 2021-01-25T16:51:21Z

@nicolas-f @SPalominos

katzyn · 2021-01-26T00:35:15Z

I don't ask about maximum possible length, LOBs in H2 use 64-bit lengths and it's not going to be changed for these specialized types.

My question is about maximum length of inlined objects, it should be small enough. Objects larger than this size are stored in the LOB storage. Smaller objects are stored in the row like non-LOB values, access to them is faster, but table scan is slower, because these smaller objects are read together with the whole row.

grandinj · 2021-01-26T07:24:03Z

Yeah, the default INPLACE length is a little small for MVStore, it was probably configured with PageStore in mind

nicolas-f · 2021-01-27T13:29:11Z

Could it be defined with DbSettings constant ? With the default value being appropriate for MVStore ?

katzyn · 2021-01-27T13:31:46Z

https://h2database.com/html/commands.html#set_max_length_inplace_lob
(published documentation is not correct, the default is 256).

nicolas-f · 2021-01-27T13:47:07Z

Nice, so we could use this existing setting ?

ebocher · 2021-01-27T13:51:43Z

But this setting is for all lob types ? not reserved to geometry ?

katzyn · 2021-01-27T14:25:54Z

It is for all types of LOBs. GEOMETRY is not yet a LOB. But it will be, because your project needs to store large values.

We can add a separate setting, but who may really need to set different in-place limits for CHARACTER LARGE OBJECT, BINARY LARGE OBJECT, GEOMETRY, and JSON? I don't see a good case for that.

On the other hand, we can define different defaults for PageStore and MVStore. MVStore with its copy-on-write storage may work better with higher value, but it needs to be tested.

katzyn · 2021-01-27T14:32:59Z

For these specialized LOBs we may want to store some additional properties in LOB reference to improve performance of some operations. For JSON type of its content can be needed of IS JSON predicate. For geometry values type, dimension system, bounding box, and possibly SRID are needed. Bounding boxes are especially needed for && operator, it would be nice to store them in-place even for large objects to avoid reads from LOB backend during row filtration.

ebocher · 2021-01-27T14:52:26Z

it would be nice to store them in-place even for large objects to avoid reads from LOB backend during row filtration.

It will be excellent !

ebocher · 2021-01-27T14:53:09Z

Thanks for your nice comments

ebocher · 2021-02-04T15:34:42Z

@katzyn
Is it possible to temporary removed this size limitation on geometry while waiting for a more sophisticated fixed.
This issue is really blocking for us.

katzyn · 2021-02-04T15:45:49Z

When you compile your own snapshot jar you can add any modifications to sources. You need to increase value of Constants.MAX_STRING_LENGTH.

But there are no good reasons to touch this constant in the mainline H2. We already decide how to with the issue properly and it will be fixed hopefully in the near future.

ebocher · 2021-02-04T15:55:02Z

We already decide how to with the issue properly and it will be fixed hopefully in the near future.

We hope that it will be integrated into the release H2 2.0.202 because accumulate geometries is a classic task in spatial database.
Don't hesitate to ping us if some discussions are needed for this fix.

katzyn · 2021-02-06T04:54:06Z

@grandinj
Looking on amount of various compatibility code in our storage backends and weird encoding of some values I'm wondering should we continue to support all this stuff or we can reimplement encoding of values from the scratch and forbid operations on 1.X.X databases?

We can create an alternative driver for older databases based on 1.4.200 with renamed packages, similar to org.h2.upgrade.v1_1.Driver for migrations from 1.1 to 1.2.

grandinj · 2021-02-06T06:51:26Z

I'd be very reluctant to drop support for older stuff, especially since it's been so long since we put out a release.
Maintaining backwards compat is the price of popularity.... to paraphase "there are two kinds of database engine - the kind that everybody complains about, and the kind that nobody cares about"

katzyn · 2021-02-06T07:07:50Z

Currently we don't have any safe upgrade logic for older database files. When 1.4.X file is opened by current H2 it is silently patched on the fly and re-marked as a file in the new format. Actually it isn't guaranteed to be consistent, even for PageStore.

We really need to enforce full database re-creation in some way to avoid possible corruption caused by some unexpected incompatibility. Actually file may be already partially corrupted due to bugs of 1.4.X.

New (and patched old) files can't be opened by 1.4, so they may use any encoding of values in them.

The only problem is upgrade procedure. There are only two options.

H2 refuses to open old file with some sane error message and migration guide describes how to use SCRIPT / RUNSCRIPT properly. It this case there are no compatibility issues in encodings.
H2 creates a new file automatically and moves old data into it. Old file is preserved as backup copy. In this case we need some code able to read old files, possibly in a separate optional jar (like the old addon for 1.2) or we can continue to support all old encodings and also many other compatibility stuff in MVStore.

katzyn · 2021-02-06T07:19:02Z

Some time ago I tried to make these on-the-fly patches reliable enough, but after some tests with various old databases I must say this way isn't very safe at least for MVStore. A new database file with a copy of old data is more reliable.

But how old file should be read? Using current code or code from 1.4.200? It looks like old code (including old storage backends) is safer, but I'm not sure.

grandinj · 2021-02-06T07:23:46Z

It's always safer to read and dump from an older version.

Yes, we should check for older versions and warn.

katzyn · 2021-02-06T07:32:05Z

How about a tool based on 1.4.200 with renamed packages that is able to read 1.4.200 files and export them into more or less valid for newer versions SQL, optionally using a configuration with adjusted data types of columns?

Export from 1.4.200 itself may need manual adjustments or compatibility flags during import and we can't load 2.0 and 1.4.200 from the same classloader; this limitation complicates the upgrade procedure for applications.

katzyn · 2021-02-06T10:21:10Z

It is also not that hard to
(a) grab an older version from maven
(b) load that jar from an older version inside a custom classloader, and invoke stuff inside it.

Nice idea, I think we can include its implementation into H2. But driver's registration needs special attention. Some quirks may be needed.

But I don't see how we can guess the required version automatically. Unfortunately, H2 doesn't preserve its version into database file. We can detect all older databases because H2 2.0 uses different file format version, but it's hard to determine the version used with this file last time. Databases from 1.4.200 can have CREATE_BUILD = 197, of example, if someone was lucky enough and database was automatically upgraded to 1.4.200 without any significant issues.

So the upgrade procedure needs a configuration parameter with the version of H2 used last time with this file.

ebocher · 2021-02-06T13:07:16Z

Not sure to understand everything about this discussion. Do you want to find a way to migrate from old H2 database to new one that will support new geometry storage. Am I right ?

katzyn · 2021-02-06T13:56:25Z

No, we need a safe way to migrate from 1.4 to 2.0, the lack of such way is a release blocker. This way isn't implemented yet. If this way will not use current code to read old databases, it will be possible to change storage formats of some data types and drop various old variants. Some data types already have multiple storage formats.

To fix this issue about length of JSON and GEOMETRY we need to introduce new storage formats for them. Without requirement to read older files directly there will be no need to preserve additional support for their old storage formats and code will be simpler.

katzyn · 2021-03-28T03:39:41Z

For JSON we may want to store the object type (ARRAY, OBJECT, or SCALAR) in the value itself.

For GEOMETRY its type and dimension system, SRID, and envelope are needed. Envelope is used by spatial indexes and && operator, that's why I included it into this list.

Maybe H2 should also store octet length for CLOBs, currently only the character length is persisted, but H2 stores them in UTF-8, so H2 should know octet length when CLOB is created, but this information isn't stored.

katzyn · 2021-04-11T06:36:23Z

@grandinj
I'm not sure what to do with current hierarchy of LOB value classes. H2 uses the same classes for both character and binary large objects, because they are very similar to each other, but JSON has more differences and GEOMETRY is very different from all others. H2 has ValueLob abstract class and four implementations.

A some new abstract class like LobData can be introduced with similar four implementations. This class will be only the holder of binary data and the main value classes will hold additional information and reference to one of these LobData implementations.
Additional optional fields for JSON-specific and GEOMETRY-specific properties can be added directly to existing ValueLob class.

grandinj · 2021-04-11T07:14:13Z

@katzyn yeah, that is tricky. The current sub-types of ValueLob are "wrong" in that they are really storage/runtime strategies instead of being something related to the SQL type.
So possibly we should re-factor ValueLob to have a child object that holds the storage strategy, and then we can have a ValueJson and ValueGeometry that subclass ValueLob.

katzyn · 2021-04-24T09:24:39Z

@grandinj
Do we need a some limit on size of string and numeric values inside JSON data? JSON normalization and validation procedure currently allows very large strings and numbers that may lead to OOME. Perhaps we can use the same limits as we have for SQL character string and numeric values.

katzyn · 2021-04-24T09:35:06Z

It also looks like we shouldn't cache too large geometry objects from the optional JTS library in ValueGeometry instances. If internal EWKB representation isn't longer than our limit for binary strings the JTS org.locationtech.jts.geom.Geometry can be cached, in other cases it should be safer to construct it only on demand.

The cache can be completely removed, but it may slow down some operations. H2's own conversion from EWKB to Geometry is relatively fast (it was faster than code from JTS), but it is still slower than Geometry.copy() (we can't return the same instance, because geometries are mutable).

grandinj · 2021-04-24T16:03:23Z

@grandinj
Do we need a some limit on size of string and numeric values inside JSON data? JSON normalization and validation procedure currently allows very large strings and numbers that may lead to OOME. Perhaps we can use the same limits as we have for SQL character string and numeric values.

Sounds sane to me.

like we shouldn't cache too large geometry objects

Agreed, the existing ValueLob only caches smallish lobs in memory, we should use the same limiting mechanism

ebocher · 2021-04-26T06:38:22Z

H2GIS uses intensively geometry objects so performance is an important issue. To give you an idea, we have just processed the OpenStreetMap data set for Europe with H2-H2GIS (see http://monitoring.orbisgis.org/). It represents more than 800 millions of geometries and we use a bulk parallel processing task. Thanks to the great H2 database ;-)

katzyn · 2021-04-30T13:09:10Z

I think we can also store octet length for CLOB values in ValueClob. H2 stores them as bytes anyway, so this length is known during write, but it isn't persisted anywhere. Its evaluation on demand can be expensive.

grandinj · 2021-04-30T13:16:24Z

I think we can also store octet length for CLOB values in ValueClob. H2 stores them as bytes anyway, so this length is known during write, but it isn't persisted anywhere. Its evaluation on demand can be expensive.

sounds like a nice win

katzyn · 2021-04-30T13:18:29Z

There is a minor complication with JSON and geometry persistence. Our current conversions / normalization / validation code is controlled by the source side, but LOB backends use an input stream. Similar problem exists in our XML support in the JDBC layer, but this layer uses additional threads.

I think such solution is too expensive there. Maybe their processing should be rewritten to be controlled by the destinations side or LOB backends should be able to work with output streams too.

In the worst case small objects can be processed by the current in-memory code, and large objects can be processed with a helper thread like XML values.

katzyn · 2021-06-28T11:35:04Z

Sorry, I didn't have time to finish my work on this issue, will try to do that on the next weekend.

ebocher · 2021-06-28T11:44:26Z

Don't worry. You are already performing an important job.
Ping us when you are ready. We will test it in H2GIS unit tests.

katzyn · 2021-08-05T09:54:29Z

There is a problem with TCP layer. When server and client have different versions of H2, we need to use the old way to pass the data over network.

It isn't very hard, but all these alternative code paths aren't covered by our test suite.

@grandinj
Maybe we need a new test case to check basic functionality of all data types over network with the new server and some older clients and new client and some older servers?

grandinj · 2021-08-05T18:43:57Z

not sure how to write such a test case, but yes, that would be a good idea

ebocher · 2021-08-18T13:18:54Z

@katzyn
Did you find the time to fix this issue ?
Thanks

katzyn · 2021-08-24T11:47:33Z

Not yet and I also found a design problem in my draft implementation for remote protocol: H2 can be fooled to preserve an object with wrong metadata by malicious client.

ebocher · 2021-10-05T05:37:23Z

This PR has been closed but I think this issue is partially solved. @katzyn any feedback or plan to solve large geometry and json objects in the few weeks ?
Is there any plan to release before the end of the year?
Bests

ebocher · 2021-11-25T09:53:59Z

@grandinj

A new H2 release is in the pipe : #3210
We have a compiled version of H2 on our side that disable the value byte length
https://github.com/orbisgis/h2database/blob/master/h2/src/main/org/h2/value/ValueBytesBase.java#L33
It seems that @katzyn is not anymore on H2 community but this length constrain is a hard limitation for H2GIS extension.
So is it possible to integrate this fix in the next release ?
@andreitokar

grandinj · 2021-11-25T09:58:32Z

What (from your perspective) would be a reasonable limit to set on these objects?

ebocher · 2021-11-25T10:05:53Z

Hi @grandinj

Thanks for the comment.
It's quite difficult to give a limit because the size of geometry can be very significant and many H2GIS users use the st_accum function (that aggregates geometries) which is one of the most important functions to build spatial processes.

So I want to said the reasonable limit will be a custom size. What about a set valuebyte_limit function ?

push the check for large sizes down to minimise unwanted changes to other types

…s-#3016 Large JSON and GEOMETRY objects #3016

katzyn mentioned this issue Mar 5, 2021

Release H2 2.0.202 #2491

Closed

ebocher added a commit to ebocher/h2database that referenced this issue Sep 2, 2021

Temporary workarround while waiting for the fix h2database#3016

2d876cc

grandinj closed this as completed in 12e9ebc Sep 7, 2021

grandinj added a commit to grandinj/h2database that referenced this issue Nov 25, 2021

Large JSON and GEOMETRY objects h2database#3016

0bed95c

push the check for large sizes down to minimise unwanted changes to other types

andreitokar added a commit that referenced this issue Nov 25, 2021

Merge pull request #3211 from grandinj/Large-JSON-and-GEOMETRY-object…

7381627

…s-#3016 Large JSON and GEOMETRY objects #3016

sync-by-unito bot mentioned this issue Dec 22, 2021

Bump h2 from 1.4.200 to 2.0.202 in /liquibase-dist liquibase/liquibase#2285

Merged

Large JSON and GEOMETRY objects #3016

Large JSON and GEOMETRY objects #3016

Comments

katzyn commented Jan 24, 2021

ebocher commented Jan 24, 2021 • edited

katzyn commented Jan 24, 2021

ebocher commented Jan 24, 2021

SPalominos commented Jan 25, 2021

grandinj commented Jan 25, 2021

nicolas-f commented Jan 25, 2021

katzyn commented Jan 25, 2021

katzyn commented Jan 25, 2021

ebocher commented Jan 25, 2021

ebocher commented Jan 25, 2021

ebocher commented Jan 25, 2021

katzyn commented Jan 26, 2021 • edited

grandinj commented Jan 26, 2021

nicolas-f commented Jan 27, 2021

katzyn commented Jan 27, 2021

nicolas-f commented Jan 27, 2021

ebocher commented Jan 27, 2021

katzyn commented Jan 27, 2021

katzyn commented Jan 27, 2021

ebocher commented Jan 27, 2021

ebocher commented Jan 27, 2021

ebocher commented Feb 4, 2021

katzyn commented Feb 4, 2021

ebocher commented Feb 4, 2021

katzyn commented Feb 6, 2021

grandinj commented Feb 6, 2021

katzyn commented Feb 6, 2021 • edited

katzyn commented Feb 6, 2021

grandinj commented Feb 6, 2021

katzyn commented Feb 6, 2021

katzyn commented Feb 6, 2021

ebocher commented Feb 6, 2021

katzyn commented Feb 6, 2021

katzyn commented Mar 28, 2021

katzyn commented Apr 11, 2021

grandinj commented Apr 11, 2021

katzyn commented Apr 24, 2021

katzyn commented Apr 24, 2021

grandinj commented Apr 24, 2021

ebocher commented Apr 26, 2021

katzyn commented Apr 30, 2021

grandinj commented Apr 30, 2021

katzyn commented Apr 30, 2021

katzyn commented Jun 28, 2021

ebocher commented Jun 28, 2021

katzyn commented Aug 5, 2021

grandinj commented Aug 5, 2021

ebocher commented Aug 18, 2021

katzyn commented Aug 24, 2021

ebocher commented Oct 5, 2021

ebocher commented Nov 25, 2021

grandinj commented Nov 25, 2021

ebocher commented Nov 25, 2021

ebocher commented Jan 24, 2021 •

edited

katzyn commented Jan 26, 2021 •

edited

katzyn commented Feb 6, 2021 •

edited