Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large JSON and GEOMETRY objects #3016

Closed
katzyn opened this issue Jan 24, 2021 · 54 comments
Closed

Large JSON and GEOMETRY objects #3016

katzyn opened this issue Jan 24, 2021 · 54 comments

Comments

@katzyn
Copy link
Contributor

katzyn commented Jan 24, 2021

In the current H2 sizes of JSON and GEOMETRY objects are limited to 1,048,576 bytes. Sometimes it isn't enough (orbisgis/h2gis#1179).

Hypothetically we can make this limit configurable, but from my point of view all really large objects must be off-heap on the server side (excluding possible usage in user-defined functions, we can't control them). It means they should be stored in the same way as LOBs.

There are two possible approaches. We can allocate a couple of new data types (JSON LARGE OBJECT and GEOMETRY LARGE OBJECT) or can add some storage specifiers (similar to FORMAT JSON) to existing CHARACTER LARGE OBJECT and BINARY LARGE OBJECT data types. I think new data types are less intrusive. We also can convert these data types to LOB data types, but with large in-place limit.

@grandinj
@ebocher

@ebocher
Copy link
Contributor

ebocher commented Jan 24, 2021

Thanks. What do you mean by ''allocate a couple of new data types (JSON LARGE OBJECT and GEOMETRY LARGE OBJECT) ''

Does this mean that we will have to use this type in the sql syntax ? Or is it an internal type for H2 to adapt the behaviour of H2 when the geometry is too big ?

@nicolas-f @SPalominos

@katzyn
Copy link
Contributor Author

katzyn commented Jan 24, 2021

This first option is about new data types allowing larger values but possibly not usable by some H2's functions and indexes. Spatial indexes may be supported, because they use only 2D bounding box. Regular indexes on these values aren't really usable.

PostgreSQL is far from the Standard and other database systems here. It doesn't have standard binary string and large object data types, but has various non-standard features. PostGIS simply uses these features. But H2 is not the PostgreSQL.

The third option is about conversion of these data types to LOB data types under the same names (JSON and GEOMETRY).

@ebocher
Copy link
Contributor

ebocher commented Jan 24, 2021

Event if PostgreSQL is far the Standard, geometry is a standardised data type. I'm not in favour of a specific data type. This will break compatibility with other spatial extensions like PostGIS and be confused for the user. So I'm more comfortable with the third option. If I remember correctly. The first versions of H2gis used the LOB format.

@SPalominos
Copy link

I'm also in favor to keep the GEOMETRY type as it is defined in the standard, in order to stay compatible with other spatial DB or library and to avoid confusion between a custom H2 GEOMETRY type and the standardized one.
@katzyn What will be the impact of the conversion of the GEOMETRY and JSON types into LOB on their indexation?

@grandinj
Copy link
Contributor

This

We also can convert these data types to LOB data types, but with large in-place limit.
is probably the best option

@nicolas-f
Copy link
Contributor

Using Lob is a great idea, it will reduce the case of multiple in-memory storage of the geometry. We may still have to do some modifications in H2GIS code in order to accommodate to the stream nature of this data type..

@katzyn
Copy link
Contributor Author

katzyn commented Jan 25, 2021

What will be the impact of the conversion of the GEOMETRY and JSON types into LOB on their indexation?

Spatial indexes on GEOMETRY columns will work, other indexes will be disallowed.

as it is defined in the standard

I don't see this data type in the SQL Standard and Standard actually doesn't cover data types even for standard functionality, such as JSON.

Some DBMS, however, have it, but its parameters are different. In the SQL Server this data type doesn't have any parameters. PostGIS extension has some parameters. Recent versions of H2 have parameters copied from PostGIS with minor intentional difference in SRID handling.

It looks like in SQL Server this data type uses external storage too.

BTW, maybe add additional optional maximum encoded length paramater to GEOMETRY? Something like
GEOMETRY [([length[lengthUnit], geometryType [, srid])]:

GEOMETRY
GEOMETRY(10M)
GEOMETRY(POINT Z)
GEOMETRY(POINT Z, 4326)
GEOMETRY(10M, POINT Z)
GEOMETRY(10M, POINT Z, 4326)

H2 had it in the past, but it was removed some time ago. LOBs in H2 aren't limited to 2GiB, they can be much larger, but in some cases developers may want to disallow too large values.

OK, let's convert them into specialized LOBs. Our own validators / normalizers of JSON and geometry values should be able to handle streamed data larger than amount of available memory after minor modifications, they don't construct whole objects in memory.

@katzyn
Copy link
Contributor Author

katzyn commented Jan 25, 2021

In H2 with default settings BLOBs not larger than 256 bytes are stored directly in the row, CLOBs have the same limit, but I don't remember is it set in bytes (in UTF-8 encoding) or in characters. This limit is configurable.

  1. Should JSON use the same default limit and configuration option?
  2. Should GEOMETRY use the same default limit and configuration option?

If not, we need to choose limits for them.

@ebocher
Copy link
Contributor

ebocher commented Jan 25, 2021

I don't see this data type in the SQL Standard and Standard actually doesn't cover data types even for standard functionality, such as JSON.

You are right. PostGIS uses the geometry type names as defined by SQL MM ( Figure: SQL Geometry Type hierarchy). Oracle uses SDO_GEOMETRY, spatialite has also its own definition and Mysql is closed to postgis type.

@ebocher
Copy link
Contributor

ebocher commented Jan 25, 2021

In H2 with default settings BLOBs not larger than 256 bytes are stored directly in the row, CLOBs have the same limit, but I don't remember is it set in bytes (in UTF-8 encoding) or in characters. This limit is configurable.

1. Should `JSON` use the same default limit and configuration option?

2. Should `GEOMETRY` use the same default limit and configuration option?

If not, we need to choose limits for them.

It seems for POSTGIS : the maximum value of a 32-bit signed integer (2,147,483,647) (defined in GSERIALIZED structure)

@ebocher
Copy link
Contributor

ebocher commented Jan 25, 2021

@nicolas-f @SPalominos

@katzyn
Copy link
Contributor Author

katzyn commented Jan 26, 2021

I don't ask about maximum possible length, LOBs in H2 use 64-bit lengths and it's not going to be changed for these specialized types.

My question is about maximum length of inlined objects, it should be small enough. Objects larger than this size are stored in the LOB storage. Smaller objects are stored in the row like non-LOB values, access to them is faster, but table scan is slower, because these smaller objects are read together with the whole row.

@grandinj
Copy link
Contributor

Yeah, the default INPLACE length is a little small for MVStore, it was probably configured with PageStore in mind

@nicolas-f
Copy link
Contributor

Could it be defined with DbSettings constant ? With the default value being appropriate for MVStore ?

@katzyn
Copy link
Contributor Author

katzyn commented Jan 27, 2021

https://h2database.com/html/commands.html#set_max_length_inplace_lob
(published documentation is not correct, the default is 256).

@nicolas-f
Copy link
Contributor

Nice, so we could use this existing setting ?

@ebocher
Copy link
Contributor

ebocher commented Jan 27, 2021

But this setting is for all lob types ? not reserved to geometry ?

@katzyn
Copy link
Contributor Author

katzyn commented Jan 27, 2021

It is for all types of LOBs. GEOMETRY is not yet a LOB. But it will be, because your project needs to store large values.

We can add a separate setting, but who may really need to set different in-place limits for CHARACTER LARGE OBJECT, BINARY LARGE OBJECT, GEOMETRY, and JSON? I don't see a good case for that.

On the other hand, we can define different defaults for PageStore and MVStore. MVStore with its copy-on-write storage may work better with higher value, but it needs to be tested.

@katzyn
Copy link
Contributor Author

katzyn commented Jan 27, 2021

For these specialized LOBs we may want to store some additional properties in LOB reference to improve performance of some operations. For JSON type of its content can be needed of IS JSON predicate. For geometry values type, dimension system, bounding box, and possibly SRID are needed. Bounding boxes are especially needed for && operator, it would be nice to store them in-place even for large objects to avoid reads from LOB backend during row filtration.

@ebocher
Copy link
Contributor

ebocher commented Jan 27, 2021

it would be nice to store them in-place even for large objects to avoid reads from LOB backend during row filtration.

It will be excellent !

@ebocher
Copy link
Contributor

ebocher commented Jan 27, 2021

Thanks for your nice comments

@ebocher
Copy link
Contributor

ebocher commented Feb 4, 2021

@katzyn
Is it possible to temporary removed this size limitation on geometry while waiting for a more sophisticated fixed.
This issue is really blocking for us.

@katzyn
Copy link
Contributor Author

katzyn commented Feb 4, 2021

When you compile your own snapshot jar you can add any modifications to sources. You need to increase value of Constants.MAX_STRING_LENGTH.

But there are no good reasons to touch this constant in the mainline H2. We already decide how to with the issue properly and it will be fixed hopefully in the near future.

@ebocher
Copy link
Contributor

ebocher commented Feb 4, 2021

We already decide how to with the issue properly and it will be fixed hopefully in the near future.

We hope that it will be integrated into the release H2 2.0.202 because accumulate geometries is a classic task in spatial database.
Don't hesitate to ping us if some discussions are needed for this fix.

@katzyn
Copy link
Contributor Author

katzyn commented Feb 6, 2021

@grandinj
Looking on amount of various compatibility code in our storage backends and weird encoding of some values I'm wondering should we continue to support all this stuff or we can reimplement encoding of values from the scratch and forbid operations on 1.X.X databases?

We can create an alternative driver for older databases based on 1.4.200 with renamed packages, similar to org.h2.upgrade.v1_1.Driver for migrations from 1.1 to 1.2.

@grandinj
Copy link
Contributor

grandinj commented Feb 6, 2021

I'd be very reluctant to drop support for older stuff, especially since it's been so long since we put out a release.
Maintaining backwards compat is the price of popularity.... to paraphase "there are two kinds of database engine - the kind that everybody complains about, and the kind that nobody cares about"

@katzyn
Copy link
Contributor Author

katzyn commented Feb 6, 2021

Currently we don't have any safe upgrade logic for older database files. When 1.4.X file is opened by current H2 it is silently patched on the fly and re-marked as a file in the new format. Actually it isn't guaranteed to be consistent, even for PageStore.

We really need to enforce full database re-creation in some way to avoid possible corruption caused by some unexpected incompatibility. Actually file may be already partially corrupted due to bugs of 1.4.X.

New (and patched old) files can't be opened by 1.4, so they may use any encoding of values in them.

The only problem is upgrade procedure. There are only two options.

  1. H2 refuses to open old file with some sane error message and migration guide describes how to use SCRIPT / RUNSCRIPT properly. It this case there are no compatibility issues in encodings.

  2. H2 creates a new file automatically and moves old data into it. Old file is preserved as backup copy. In this case we need some code able to read old files, possibly in a separate optional jar (like the old addon for 1.2) or we can continue to support all old encodings and also many other compatibility stuff in MVStore.

@katzyn
Copy link
Contributor Author

katzyn commented Feb 6, 2021

Some time ago I tried to make these on-the-fly patches reliable enough, but after some tests with various old databases I must say this way isn't very safe at least for MVStore. A new database file with a copy of old data is more reliable.

But how old file should be read? Using current code or code from 1.4.200? It looks like old code (including old storage backends) is safer, but I'm not sure.

@grandinj
Copy link
Contributor

grandinj commented Feb 6, 2021

It's always safer to read and dump from an older version.

Yes, we should check for older versions and warn.

@katzyn
Copy link
Contributor Author

katzyn commented Feb 6, 2021

How about a tool based on 1.4.200 with renamed packages that is able to read 1.4.200 files and export them into more or less valid for newer versions SQL, optionally using a configuration with adjusted data types of columns?

Export from 1.4.200 itself may need manual adjustments or compatibility flags during import and we can't load 2.0 and 1.4.200 from the same classloader; this limitation complicates the upgrade procedure for applications.

@katzyn
Copy link
Contributor Author

katzyn commented Feb 6, 2021

It is also not that hard to
(a) grab an older version from maven
(b) load that jar from an older version inside a custom classloader, and invoke stuff inside it.

Nice idea, I think we can include its implementation into H2. But driver's registration needs special attention. Some quirks may be needed.

But I don't see how we can guess the required version automatically. Unfortunately, H2 doesn't preserve its version into database file. We can detect all older databases because H2 2.0 uses different file format version, but it's hard to determine the version used with this file last time. Databases from 1.4.200 can have CREATE_BUILD = 197, of example, if someone was lucky enough and database was automatically upgraded to 1.4.200 without any significant issues.

So the upgrade procedure needs a configuration parameter with the version of H2 used last time with this file.

@ebocher
Copy link
Contributor

ebocher commented Feb 6, 2021

Not sure to understand everything about this discussion. Do you want to find a way to migrate from old H2 database to new one that will support new geometry storage. Am I right ?

@katzyn
Copy link
Contributor Author

katzyn commented Feb 6, 2021

No, we need a safe way to migrate from 1.4 to 2.0, the lack of such way is a release blocker. This way isn't implemented yet. If this way will not use current code to read old databases, it will be possible to change storage formats of some data types and drop various old variants. Some data types already have multiple storage formats.

To fix this issue about length of JSON and GEOMETRY we need to introduce new storage formats for them. Without requirement to read older files directly there will be no need to preserve additional support for their old storage formats and code will be simpler.

@katzyn
Copy link
Contributor Author

katzyn commented Mar 28, 2021

For JSON we may want to store the object type (ARRAY, OBJECT, or SCALAR) in the value itself.

For GEOMETRY its type and dimension system, SRID, and envelope are needed. Envelope is used by spatial indexes and && operator, that's why I included it into this list.

Maybe H2 should also store octet length for CLOBs, currently only the character length is persisted, but H2 stores them in UTF-8, so H2 should know octet length when CLOB is created, but this information isn't stored.

@katzyn
Copy link
Contributor Author

katzyn commented Apr 11, 2021

@grandinj
I'm not sure what to do with current hierarchy of LOB value classes. H2 uses the same classes for both character and binary large objects, because they are very similar to each other, but JSON has more differences and GEOMETRY is very different from all others. H2 has ValueLob abstract class and four implementations.

  1. A some new abstract class like LobData can be introduced with similar four implementations. This class will be only the holder of binary data and the main value classes will hold additional information and reference to one of these LobData implementations.

  2. Additional optional fields for JSON-specific and GEOMETRY-specific properties can be added directly to existing ValueLob class.

@grandinj
Copy link
Contributor

@katzyn yeah, that is tricky. The current sub-types of ValueLob are "wrong" in that they are really storage/runtime strategies instead of being something related to the SQL type.
So possibly we should re-factor ValueLob to have a child object that holds the storage strategy, and then we can have a ValueJson and ValueGeometry that subclass ValueLob.

@katzyn
Copy link
Contributor Author

katzyn commented Apr 24, 2021

@grandinj
Do we need a some limit on size of string and numeric values inside JSON data? JSON normalization and validation procedure currently allows very large strings and numbers that may lead to OOME. Perhaps we can use the same limits as we have for SQL character string and numeric values.

@katzyn
Copy link
Contributor Author

katzyn commented Apr 24, 2021

It also looks like we shouldn't cache too large geometry objects from the optional JTS library in ValueGeometry instances. If internal EWKB representation isn't longer than our limit for binary strings the JTS org.locationtech.jts.geom.Geometry can be cached, in other cases it should be safer to construct it only on demand.

The cache can be completely removed, but it may slow down some operations. H2's own conversion from EWKB to Geometry is relatively fast (it was faster than code from JTS), but it is still slower than Geometry.copy() (we can't return the same instance, because geometries are mutable).

@grandinj
Copy link
Contributor

@grandinj
Do we need a some limit on size of string and numeric values inside JSON data? JSON normalization and validation procedure currently allows very large strings and numbers that may lead to OOME. Perhaps we can use the same limits as we have for SQL character string and numeric values.

Sounds sane to me.

like we shouldn't cache too large geometry objects

Agreed, the existing ValueLob only caches smallish lobs in memory, we should use the same limiting mechanism

@ebocher
Copy link
Contributor

ebocher commented Apr 26, 2021

H2GIS uses intensively geometry objects so performance is an important issue. To give you an idea, we have just processed the OpenStreetMap data set for Europe with H2-H2GIS (see http://monitoring.orbisgis.org/). It represents more than 800 millions of geometries and we use a bulk parallel processing task. Thanks to the great H2 database ;-)

@katzyn
Copy link
Contributor Author

katzyn commented Apr 30, 2021

I think we can also store octet length for CLOB values in ValueClob. H2 stores them as bytes anyway, so this length is known during write, but it isn't persisted anywhere. Its evaluation on demand can be expensive.

@grandinj
Copy link
Contributor

I think we can also store octet length for CLOB values in ValueClob. H2 stores them as bytes anyway, so this length is known during write, but it isn't persisted anywhere. Its evaluation on demand can be expensive.

sounds like a nice win

@katzyn
Copy link
Contributor Author

katzyn commented Apr 30, 2021

There is a minor complication with JSON and geometry persistence. Our current conversions / normalization / validation code is controlled by the source side, but LOB backends use an input stream. Similar problem exists in our XML support in the JDBC layer, but this layer uses additional threads.

I think such solution is too expensive there. Maybe their processing should be rewritten to be controlled by the destinations side or LOB backends should be able to work with output streams too.

In the worst case small objects can be processed by the current in-memory code, and large objects can be processed with a helper thread like XML values.

@katzyn
Copy link
Contributor Author

katzyn commented Jun 28, 2021

Sorry, I didn't have time to finish my work on this issue, will try to do that on the next weekend.

@ebocher
Copy link
Contributor

ebocher commented Jun 28, 2021

Don't worry. You are already performing an important job.
Ping us when you are ready. We will test it in H2GIS unit tests.

@katzyn
Copy link
Contributor Author

katzyn commented Aug 5, 2021

There is a problem with TCP layer. When server and client have different versions of H2, we need to use the old way to pass the data over network.

It isn't very hard, but all these alternative code paths aren't covered by our test suite.

@grandinj
Maybe we need a new test case to check basic functionality of all data types over network with the new server and some older clients and new client and some older servers?

@grandinj
Copy link
Contributor

grandinj commented Aug 5, 2021

not sure how to write such a test case, but yes, that would be a good idea

@ebocher
Copy link
Contributor

ebocher commented Aug 18, 2021

@katzyn
Did you find the time to fix this issue ?
Thanks

@katzyn
Copy link
Contributor Author

katzyn commented Aug 24, 2021

Not yet and I also found a design problem in my draft implementation for remote protocol: H2 can be fooled to preserve an object with wrong metadata by malicious client.

ebocher added a commit to ebocher/h2database that referenced this issue Sep 2, 2021
@ebocher
Copy link
Contributor

ebocher commented Oct 5, 2021

This PR has been closed but I think this issue is partially solved. @katzyn any feedback or plan to solve large geometry and json objects in the few weeks ?
Is there any plan to release before the end of the year?
Bests

@ebocher
Copy link
Contributor

ebocher commented Nov 25, 2021

@grandinj

A new H2 release is in the pipe : #3210
We have a compiled version of H2 on our side that disable the value byte length
https://github.com/orbisgis/h2database/blob/master/h2/src/main/org/h2/value/ValueBytesBase.java#L33
It seems that @katzyn is not anymore on H2 community but this length constrain is a hard limitation for H2GIS extension.
So is it possible to integrate this fix in the next release ?
@andreitokar

@grandinj
Copy link
Contributor

What (from your perspective) would be a reasonable limit to set on these objects?

@ebocher
Copy link
Contributor

ebocher commented Nov 25, 2021

Hi @grandinj

Thanks for the comment.
It's quite difficult to give a limit because the size of geometry can be very significant and many H2GIS users use the st_accum function (that aggregates geometries) which is one of the most important functions to build spatial processes.

So I want to said the reasonable limit will be a custom size. What about a set valuebyte_limit function ?

grandinj added a commit to grandinj/h2database that referenced this issue Nov 25, 2021
push the check for large sizes down to minimise unwanted changes to other types
andreitokar added a commit that referenced this issue Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants