New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large JSON and GEOMETRY objects #3016
Comments
Thanks. What do you mean by ''allocate a couple of new data types (JSON LARGE OBJECT and GEOMETRY LARGE OBJECT) '' Does this mean that we will have to use this type in the sql syntax ? Or is it an internal type for H2 to adapt the behaviour of H2 when the geometry is too big ? |
This first option is about new data types allowing larger values but possibly not usable by some H2's functions and indexes. Spatial indexes may be supported, because they use only 2D bounding box. Regular indexes on these values aren't really usable. PostgreSQL is far from the Standard and other database systems here. It doesn't have standard binary string and large object data types, but has various non-standard features. PostGIS simply uses these features. But H2 is not the PostgreSQL. The third option is about conversion of these data types to LOB data types under the same names ( |
Event if PostgreSQL is far the Standard, geometry is a standardised data type. I'm not in favour of a specific data type. This will break compatibility with other spatial extensions like PostGIS and be confused for the user. So I'm more comfortable with the third option. If I remember correctly. The first versions of H2gis used the LOB format. |
I'm also in favor to keep the |
This
|
Using Lob is a great idea, it will reduce the case of multiple in-memory storage of the geometry. We may still have to do some modifications in H2GIS code in order to accommodate to the stream nature of this data type.. |
Spatial indexes on
I don't see this data type in the SQL Standard and Standard actually doesn't cover data types even for standard functionality, such as JSON. Some DBMS, however, have it, but its parameters are different. In the SQL Server this data type doesn't have any parameters. PostGIS extension has some parameters. Recent versions of H2 have parameters copied from PostGIS with minor intentional difference in SRID handling. It looks like in SQL Server this data type uses external storage too. BTW, maybe add additional optional maximum encoded length paramater to
H2 had it in the past, but it was removed some time ago. LOBs in H2 aren't limited to 2GiB, they can be much larger, but in some cases developers may want to disallow too large values. OK, let's convert them into specialized LOBs. Our own validators / normalizers of JSON and geometry values should be able to handle streamed data larger than amount of available memory after minor modifications, they don't construct whole objects in memory. |
In H2 with default settings BLOBs not larger than 256 bytes are stored directly in the row, CLOBs have the same limit, but I don't remember is it set in bytes (in UTF-8 encoding) or in characters. This limit is configurable.
If not, we need to choose limits for them. |
You are right. PostGIS uses the geometry type names as defined by SQL MM ( Figure: SQL Geometry Type hierarchy). Oracle uses SDO_GEOMETRY, spatialite has also its own definition and Mysql is closed to postgis type. |
It seems for POSTGIS : the maximum value of a 32-bit signed integer (2,147,483,647) (defined in GSERIALIZED structure) |
I don't ask about maximum possible length, LOBs in H2 use 64-bit lengths and it's not going to be changed for these specialized types. My question is about maximum length of inlined objects, it should be small enough. Objects larger than this size are stored in the LOB storage. Smaller objects are stored in the row like non-LOB values, access to them is faster, but table scan is slower, because these smaller objects are read together with the whole row. |
Yeah, the default INPLACE length is a little small for MVStore, it was probably configured with PageStore in mind |
Could it be defined with DbSettings constant ? With the default value being appropriate for MVStore ? |
https://h2database.com/html/commands.html#set_max_length_inplace_lob |
Nice, so we could use this existing setting ? |
But this setting is for all lob types ? not reserved to geometry ? |
It is for all types of LOBs. We can add a separate setting, but who may really need to set different in-place limits for On the other hand, we can define different defaults for PageStore and MVStore. MVStore with its copy-on-write storage may work better with higher value, but it needs to be tested. |
For these specialized LOBs we may want to store some additional properties in LOB reference to improve performance of some operations. For JSON type of its content can be needed of |
It will be excellent ! |
Thanks for your nice comments |
@katzyn |
When you compile your own snapshot jar you can add any modifications to sources. You need to increase value of But there are no good reasons to touch this constant in the mainline H2. We already decide how to with the issue properly and it will be fixed hopefully in the near future. |
We hope that it will be integrated into the release H2 2.0.202 because accumulate geometries is a classic task in spatial database. |
@grandinj We can create an alternative driver for older databases based on 1.4.200 with renamed packages, similar to |
I'd be very reluctant to drop support for older stuff, especially since it's been so long since we put out a release. |
Currently we don't have any safe upgrade logic for older database files. When 1.4.X file is opened by current H2 it is silently patched on the fly and re-marked as a file in the new format. Actually it isn't guaranteed to be consistent, even for PageStore. We really need to enforce full database re-creation in some way to avoid possible corruption caused by some unexpected incompatibility. Actually file may be already partially corrupted due to bugs of 1.4.X. New (and patched old) files can't be opened by 1.4, so they may use any encoding of values in them. The only problem is upgrade procedure. There are only two options.
|
Some time ago I tried to make these on-the-fly patches reliable enough, but after some tests with various old databases I must say this way isn't very safe at least for MVStore. A new database file with a copy of old data is more reliable. But how old file should be read? Using current code or code from 1.4.200? It looks like old code (including old storage backends) is safer, but I'm not sure. |
It's always safer to read and dump from an older version. Yes, we should check for older versions and warn. |
How about a tool based on 1.4.200 with renamed packages that is able to read 1.4.200 files and export them into more or less valid for newer versions SQL, optionally using a configuration with adjusted data types of columns? Export from 1.4.200 itself may need manual adjustments or compatibility flags during import and we can't load 2.0 and 1.4.200 from the same classloader; this limitation complicates the upgrade procedure for applications. |
Nice idea, I think we can include its implementation into H2. But driver's registration needs special attention. Some quirks may be needed. But I don't see how we can guess the required version automatically. Unfortunately, H2 doesn't preserve its version into database file. We can detect all older databases because H2 2.0 uses different file format version, but it's hard to determine the version used with this file last time. Databases from 1.4.200 can have So the upgrade procedure needs a configuration parameter with the version of H2 used last time with this file. |
Not sure to understand everything about this discussion. Do you want to find a way to migrate from old H2 database to new one that will support new geometry storage. Am I right ? |
No, we need a safe way to migrate from 1.4 to 2.0, the lack of such way is a release blocker. This way isn't implemented yet. If this way will not use current code to read old databases, it will be possible to change storage formats of some data types and drop various old variants. Some data types already have multiple storage formats. To fix this issue about length of |
For For Maybe H2 should also store octet length for CLOBs, currently only the character length is persisted, but H2 stores them in UTF-8, so H2 should know octet length when CLOB is created, but this information isn't stored. |
@grandinj
|
@katzyn yeah, that is tricky. The current sub-types of ValueLob are "wrong" in that they are really storage/runtime strategies instead of being something related to the SQL type. |
@grandinj |
It also looks like we shouldn't cache too large geometry objects from the optional JTS library in The cache can be completely removed, but it may slow down some operations. H2's own conversion from EWKB to |
Sounds sane to me.
Agreed, the existing ValueLob only caches smallish lobs in memory, we should use the same limiting mechanism |
H2GIS uses intensively geometry objects so performance is an important issue. To give you an idea, we have just processed the OpenStreetMap data set for Europe with H2-H2GIS (see http://monitoring.orbisgis.org/). It represents more than 800 millions of geometries and we use a bulk parallel processing task. Thanks to the great H2 database ;-) |
I think we can also store octet length for |
sounds like a nice win |
There is a minor complication with JSON and geometry persistence. Our current conversions / normalization / validation code is controlled by the source side, but LOB backends use an input stream. Similar problem exists in our XML support in the JDBC layer, but this layer uses additional threads. I think such solution is too expensive there. Maybe their processing should be rewritten to be controlled by the destinations side or LOB backends should be able to work with output streams too. In the worst case small objects can be processed by the current in-memory code, and large objects can be processed with a helper thread like XML values. |
Sorry, I didn't have time to finish my work on this issue, will try to do that on the next weekend. |
Don't worry. You are already performing an important job. |
There is a problem with TCP layer. When server and client have different versions of H2, we need to use the old way to pass the data over network. It isn't very hard, but all these alternative code paths aren't covered by our test suite. @grandinj |
not sure how to write such a test case, but yes, that would be a good idea |
@katzyn |
Not yet and I also found a design problem in my draft implementation for remote protocol: H2 can be fooled to preserve an object with wrong metadata by malicious client. |
This PR has been closed but I think this issue is partially solved. @katzyn any feedback or plan to solve large geometry and json objects in the few weeks ? |
A new H2 release is in the pipe : #3210 |
What (from your perspective) would be a reasonable limit to set on these objects? |
Hi @grandinj Thanks for the comment. So I want to said the reasonable limit will be a custom size. What about a set valuebyte_limit function ? |
push the check for large sizes down to minimise unwanted changes to other types
…s-#3016 Large JSON and GEOMETRY objects #3016
In the current H2 sizes of
JSON
andGEOMETRY
objects are limited to 1,048,576 bytes. Sometimes it isn't enough (orbisgis/h2gis#1179).Hypothetically we can make this limit configurable, but from my point of view all really large objects must be off-heap on the server side (excluding possible usage in user-defined functions, we can't control them). It means they should be stored in the same way as LOBs.
There are two possible approaches. We can allocate a couple of new data types (
JSON LARGE OBJECT
andGEOMETRY LARGE OBJECT
) or can add some storage specifiers (similar toFORMAT JSON
) to existingCHARACTER LARGE OBJECT
andBINARY LARGE OBJECT
data types. I think new data types are less intrusive. We also can convert these data types to LOB data types, but with large in-place limit.@grandinj
@ebocher
The text was updated successfully, but these errors were encountered: