DBZ-1413 Support PostgreSQL Domain and Enum types #1079

Naros · 2019-10-15T14:31:11Z

https://issues.jboss.org/browse/DBZ-1413
https://issues.jboss.org/browse/DBZ-920

Naros · 2019-11-05T23:11:36Z

@jpechane If you have time, could you give this a skim and let me know if I am on the right track

jpechane

@Naros Generally, this lookf nice to me! I've left a few comments in the code.
I also like the extended test covergae. Just please make sure that all domain usecases are covered and domain of domain also works.

What will happen when uknown type is encountered that is not a domain type?

debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresType.java

debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/TypeRegistry.java

jpechane · 2019-11-06T13:34:24Z

...-postgres/src/main/java/io/debezium/connector/postgresql/connection/AbstractColumnValue.java

@@ -36,12 +38,17 @@
    private static final Logger LOGGER = LoggerFactory.getLogger(AbstractColumnValue.class);

    @Override
-    public LocalDate asLocalDate() {
+    public Object asLocalDate() {


Why this need to be changed?

The came down to the fact that PgProtoColumnValue extended AbstractColumnValue and it needed to return a Duration rather than a LocalDate. In fact, for all the return type changes I had to make to the abstract class and its interface, these were due to keeping the same return types from the prior code this replaced. Do you have a more preferred solution for this?

I still don't quite understand this. Why wasn't this needed before?

When AbstractColumnValue was first implemented, it was only used by pgoutput and wal2json, of which both returned a LocalDate type. When migrating decodebufs to use this class when implementing PgProtoColumnValue, I changed this to account for the fact that PgProtoReplicationMessage#getValue returned different types.

In hindsight I guess I could have left this as-is and instead implemented PgColumnValue#asLocalDate as follows:

@Override public LocalDate asLocalDate() { if (value.hasDatumInt32()) { return LocalDate.ofEpochDay((long) value.getDatumInt32()); } final String s = asString(); return s != null ? DateTimeFormat.get().date(s) : null; }

Is there a preference here?

Yes, that one seems definitely nicer. So perhaps most (all?) return type changes in AbstractColumnValue could be reverted then?

For now I've reverted all but the ones related to asTime() and asLocalTime() given that I have yet to find a solution that works with PostgreSQL allowing 24:00:00 in TIME fields; see commit 4389c16.

Naros · 2019-11-11T17:22:28Z

While adding more test coverage for snapshot phase, I uncovered the following: pgjdbc/pgjdbc#1604

The cliff notes version here is if a user defines a column's data type as:

CREATE DOMAIN varbit2 AS varbit(3);
CREATE TABLE test_table (pk serial, value varbit2, primary key(pk));

When we build the table's schema during snapshot, we get the column's dimensions from the JDBC driver's DatabaseMetadata#getColumns call which currently reports the value column's length as 2147483647 rather than the anticipated 3.

Obviously this would lead to erroneous schema change events emitted when a streaming change is detected for the table as streaming properly resolves a column's dimensions based on the column type's modifier value.

davecramer · 2019-11-12T14:22:52Z

@Naros OK, so I see a fairly easy way to fix this in the driver. However.... I'm curious what else it needed? Are you planning on supporting UDT's ?

Naros · 2019-11-12T18:36:25Z

@davecramer A recent comment on the JIRA issue leads me to say yes.
Specifically mentioned on the issue is CREATE TYPE mytype as ENUM ( ... );

davecramer · 2019-11-12T19:01:57Z

ugh.. ENUM's suck in pg.

gunnarmorling · 2019-11-22T16:53:27Z

@Naros, could you clarify whether this one here then also would resolve the enum issue? If so, that'd be great. A user just was asking for that again on DBZ-920.

davecramer · 2019-11-22T20:16:21Z

@gunnarmorling as far as I know it resolves the enum issue and I just pushed pgjdbc/pgjdbc#1611 for numeric. I haven't released a new version yet though

Naros · 2019-11-25T20:46:50Z

@gunnarmorling Commit 36e1b06 should provide Enum support.

When we emit events, enum-based columns are now io.debezium.data.Enum schemas. If a user wants to get the underlying database type information, they can also enable the column source type propagation configuration, column.propagate.source.type for the column to get the underlying database type as well.

For example CREATE TYPE test_type as ENUM ('V1', 'V2') will specify that the type is TEST_TYPE when column source type propagation is enabled for a column which uses test_type.

The tests will likely fail for this until we update the driver, but I did test this with a local build of pgjdbc using commit: pgjdbc/pgjdbc@0507979 .

gunnarmorling

@Naros, huge piece of work, thanks a lot! A few comments inline, mostly minor things. I'm still struggling a bit though to understand the larger picture of the change. Could you perhaps provide a few sentences that explain the overall design/strategy and the refactorings you did? That'd help a lot. Thanks!

gunnarmorling · 2019-11-28T16:31:27Z

debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresType.java

@@ -47,6 +54,17 @@ public boolean isArrayType() {
        return elementType != null;
    }

+    /**
+     * @return true if this type is a base type


What is a "base type"? Can you clarify in the JavaDoc?

In the recent commits, I decided to expand on this with the notion that a PostgresType has both a base type, which is documented but also allows for fetching a type's root type, the top most type in the type hierarchy. This proved nice to abstract the iterative nature away where we needed this as part of the type.

...es/src/main/java/io/debezium/connector/postgresql/connection/pgproto/PgProtoColumnValue.java

debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresType.java

gunnarmorling · 2019-11-29T09:15:40Z

...onnector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresValueConverter.java

@@ -907,4 +920,54 @@ protected Object convertString(Column column, Field fieldDefn, Object data) {
        }
        return super.convertString(column, fieldDefn, data);
    }
+
+    /**


Let's avoid such empty comments. But could you add one to the method in the base class, describing its purpose?

By moving associating the right jdbc type and native type to a column at construction time, these Types.DISTINCT specific handler methods were removed as they're no longer needed.

gunnarmorling · 2019-11-29T09:17:35Z

...onnector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresValueConverter.java

+        // fact the resolved base type is bool, not some oid that resolves to an unhandled type.
+        //
+        // Perhaps there are better ways - TBD.
+        return column.edit().jdbcType(baseType.getJdbcId()).nativeType(baseType.getOid()).create();


Hum, this side-effect of the get... method is unexpected. Could this be moved to the construction time?

Quite possibly, I'll look and let you know my findings.

Another option I explored as well was to instead make a small change to JdbcValueConverters where we move what the switch statements are based upon to a method argument and then delegate to those as needed, e.g.

@Override public SchemaBuilder schemaBuilder(Column column) { return resolveSchemaBuilderByJdbcType(column.jdbcType(), column); } protected SchemaBuilder resolveSchemaBuilderByJdbcType(int jdbcType, Column column) { switch (jdbcType) { ... } }

This allows PostgresValueConverter to actually call those new resolve methods directly and rather than create a new Column instance solely for the sake of controlling the switch argument, we can instead change those methods like so:

@Override protected SchemaBuilder distinctSchema(Column column) { final int rootTypeJdbcId = typeRegistry.get(column.nativeType()).getRootType().getJdbcId(); return resolveSchemaBuilderByJdbcType(rootTypeJdbcId, column); }

I am still going to test your suggestion with Column construction to be sure there are no side in the event you don't particularly care for all this method delegation here.

I decided to go with your construction approach as it seemed to solve all the corner nuances without introducing any additional side effects.

documentation/modules/ROOT/pages/connectors/postgresql.adoc

Naros · 2019-12-02T19:44:10Z

There are quite a number of changes here and it can definitely be difficult to see the bigger picture, let me know if you need anymore details than the below:

When working with decoderbufs to start, I noticed that the user-defined domain types are always sent as byte arrays. After talks with Jiri, we concluded we'd make the changes in the connector to be able to parse those byte arrays when detected. We decided that it made sense to align the connector's handler for decoderbufs with the value resolution we had already done for wal2json and pgoutput. This is what lead to the changes to AbstractColumnValue and its subclasses.

The remaining changes revolve around supporting PostgresType as a part of a hierarchy of types, allowing us to traverse from some point in the hierarchy upward. This was important to be able to accurately resolve length/scale for columns for domain types as well as the appropriate schema builder and value converters.

Naros · 2019-12-02T20:39:13Z

@gunnarmorling @jpechane While I was working on porting the streaming test that checks a slew of domain types to the snapshot integration test, I noticed a comment that we probably should discuss before merging this.

Lets assume we have a table with a column defined with a data type of BOX.

During snapshot a record exists where it was inserted using the value (0,0),(1,1), the emitted value is a bytebuffer with the contents of (0.0,0.0),(1.0,1.0). I want to point out that the emitted value contains a .0 for each of the coordinates for the box. The odd behavior comes into play during streaming however, if using wal2json or pgoutput, the emitted value is a bytebuffer with the same contents of that from the snapshot; however, when using decoderbufs its actually a bytebuffer with the contents of (0,0),(1,1), notice the missing .0.

I checked master and the same behavior is observed there. Should we open another issue to try and standardize this across the decoders? The test in RecordsStreamProducerIT shows this is an inconsistency for box, circle, line,lseg, path, and polygon data types afaict. Thoughts?

…mn native/jdbc type resolution happens

Naros · 2019-12-05T00:29:34Z

During snapshot a record exists where it was inserted using the value (0,0),(1,1), the emitted value is a bytebuffer with the contents of (0.0,0.0),(1.0,1.0). I want to point out that the emitted value contains a .0 for each of the coordinates for the box. The odd behavior comes into play during streaming however, if using wal2json or pgoutput, the emitted value is a bytebuffer with the same contents of that from the snapshot; however, when using decoderbufs its actually a bytebuffer with the contents of (0,0),(1,1), notice the missing .0.

As a follow-up, this was resolved by the changes added in commit 4389c16 where we aligned the decoderbufs decoder to return the PG types for boxes, circles, etc instead of the explicit byte array.

gunnarmorling · 2019-12-05T09:51:41Z

align the connector's handler for decoderbufs with the value resolution we had already done for wal2json and pgoutput.

Ok, thanks. That make sense 👍

As a follow-up, this was resolved by the changes added in commit

Ok, cool. Had filed already the follow-up, should have read to the end :)

gunnarmorling · 2019-12-05T09:53:19Z

@Naros, there are a few test failures which seem related. Can you take a look?

Another question: can or should we merge this one before the PG driver update is there? My preference would to get this in quickly (so to avoid potential merge conflicts), also if that means the behavior isn't actually user-visibly fixed until the driver update is pulled in subsequentially. This might imply having to disable some tests temporarily. WDYT?

gunnarmorling

@Naros LGTM, great work! Grew quite a bit bigger than we originally anticipated, thanks for pulling through with it. Last items I see:

Address test failures on CI
Log any follow up JIRA issues (if needed)
Decide on strategy in regards to awaiting the PG driver release

Once these are done, it's good to go from my perspective. Thanks again!

Naros · 2019-12-05T17:03:16Z

@gunnarmorling the test failures are most definitely due to the driver incompatibility and this comes into play when we refresh the schema from the database metadata because the column's length and scale are incorrectly reported in the metadata result-set. These failures should pass once we can upgrade to a production-ready version of the 42.2.9 driver.

I think we only really have 3 options here unfortunately:

Use 42.2.9-SNAPSHOT in CR1, hopefully being able to upgrade to 42.2.9 in our CR2/Final.
Wait on the 42.2.9 driver then incorporate it in the PR, rebasing/merging conflicts as needed.
Disable all tests pertinent to this behavior and incorporate it, follow-up jiras to upgrade driver & enable/verify tests there-after.

I'm fine with any of these. It would seem your preference is the latter so I can work toward that if that's what we'd rather do.

gunnarmorling · 2019-12-05T17:34:45Z

It would seem your preference is the latter so I can work toward that if that's what we'd rather do.

Yes, exactly. We cannot publish anything depending on SNAPSHOT versions (Central will reject the deployment) and I'd prefer to get this merged, avoiding any dependency on the PG team's schedule for our own release. So let's disable the tests requiring it, go back to the stable driver version, make this one ready to merge and then log a follow-up JIRA for updating again and enabling the tests. Could you also clarify what will work then with this intermediary version: enums, domains partially, domains not at all?

Naros · 2019-12-05T21:31:45Z

Could you also clarify what will work then with this intermediary version: enums, domains partially, domains not at all?

Support for enum data types is good to go even with an intermediate solution. The changes needed for these involved actually identifying an enum type correctly and propagating that through the pipeline. There were no changes needed from the driver perspective to get this to work.

Support for domain types is slightly a different beast. When we first snapshot a table and refresh the schema, we do this by obtaining the columns length/scale from the driver's metadata. This is where the driver's fix is important.

If a domain type inherits from a base type and defines a custom type modifier definition then length/scale isn't read correctly. For example:

CREATE DOMAIN varbit2 AS varbit(2);
CREATE TABLE t (pk SERIAL, data varbit2, PRIMARY KEY(pk));
INSERT INTO table t (data) values (B'101');

This leads to the data column having a schema of Bits.builder(Integer.MAX_VALUE) rather than Bits.builder(3). Additionally when column.propagate.source.type is enabled, the length will be emitted with the value of Integer.MAX_VALUE instead of 3. Scale is also affected although in this example it would be 0 in either case.

However, in the highly unlikely situation where a user extends a base type but does not explicitly set any type modifiers on the definition, e.g. CREATE DOMAIN varbit2 AS varbit, then the emitted schema for the snapshot event would have the right expected values as they're the defaults.

The reason this isn't an issue for streaming is that we use the TypeRegistry to resolve length/scale when we build the columns from the logical replication stream.

That however does raise the question if we could override JdbcConnection#readTableColumn for the postgres connector and utilize the TypeRegistry to resolve a type's length/scale instead rather than sourcing those two attributes from the driver metadata. That might actually work around the driver changes entirely. I'm going to give that a test drive and see as it might be a workaround solution while we wait for the driver.

Naros · 2019-12-05T23:37:44Z

That however does raise the question if we could override JdbcConnection#readTableColumn for the postgres connector and utilize the TypeRegistry to resolve a type's length/scale instead rather than sourcing those two attributes from the driver metadata. That might actually work around the driver changes entirely. I'm going to give that a test drive and see as it might be a workaround solution while we wait for the driver.

As a follow-up, this does indeed allow us to ship this PR without the driver changes.

There is also one major advantage of this strategy is that we can also resolve the length/scale of types in a deep nested hierarchy with how our TypeRegistry works. This is something the driver changes introduced by @davecramer won't be able to support as its capped at a depth of 1 rather than being unbounded most likely for performance reasons. This means that the test case I added where we expected a failure will now pass.

I'm going to push this small change and if we decide we'd rather not have these, I can always remove the commit and force push, but I think this is probably a good change to have.

WDYT @gunnarmorling ?

…ta from TypeRegistry

gunnarmorling · 2019-12-10T09:09:31Z

Excellent, thanks a lot, @Naros. Merging.

Naros requested a review from jpechane October 15, 2019 14:31

Naros force-pushed the DBZ-1413 branch 2 times, most recently from b4abf2a to 6972269 Compare November 5, 2019 23:06

Naros added 2 commits November 5, 2019 18:09

DBZ-1413 Support PostgreSQL domain types

d5d0aa5

DBZ-1413 Align domain type support across all 3 decoders

774f2ed

Naros force-pushed the DBZ-1413 branch from 6972269 to 774f2ed Compare November 5, 2019 23:09

jpechane reviewed Nov 6, 2019

View reviewed changes

Naros added 5 commits November 25, 2019 09:20

DBZ-1413 Delay building type hierarchy after all base types built

222fdf9

DBZ-1413 Updated tests

c788b47

DBZ-1413 Remove some System.out madness

3efef48

DBZ-920 Add support for PostgreSQL enum types

36e1b06

DBZ-1413 DBZ-920 Document Domain Type and ENUM support

ad97bd3

Naros changed the title ~~DBZ-1413 Support PostgreSQL domain types~~ DBZ-1413 Support PostgreSQL Domain and Enum types Nov 25, 2019

gunnarmorling reviewed Nov 29, 2019

View reviewed changes

Naros added 4 commits December 4, 2019 15:47

DBZ-1413 Suggested Javadoc changes, added/fix tests, changed how Colu…

630c849

…mn native/jdbc type resolution happens

DBZ-1413 DBZ-920 Updated documentation with suggested changes

a9a39a5

DBZ-1413 Revert change to pom.xml for PostgreSQL driver upgrade.

9f43d96

DBZ-1413 Revert AbstractColumnValue changes in commit 774f2ed

4389c16

Naros requested a review from gunnarmorling December 5, 2019 00:30

Naros requested a review from jpechane December 5, 2019 00:30

gunnarmorling approved these changes Dec 5, 2019

View reviewed changes

Naros added 2 commits December 5, 2019 19:15

DBZ-1413 Resolve column length/scale when reading table column metada…

53939f8

…ta from TypeRegistry

DBZ-1413 Fix format violation

c0482ff

gunnarmorling merged commit c32b771 into debezium:master Dec 10, 2019

DBZ-1413 Support PostgreSQL Domain and Enum types #1079

DBZ-1413 Support PostgreSQL Domain and Enum types #1079

Conversation

Naros commented Oct 15, 2019 • edited

Naros commented Nov 5, 2019

jpechane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Naros commented Nov 11, 2019

davecramer commented Nov 12, 2019

Naros commented Nov 12, 2019 • edited

davecramer commented Nov 12, 2019

gunnarmorling commented Nov 22, 2019

davecramer commented Nov 22, 2019

Naros commented Nov 25, 2019

gunnarmorling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Naros Dec 4, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Naros commented Dec 2, 2019

Naros commented Dec 2, 2019

Naros commented Dec 5, 2019

gunnarmorling commented Dec 5, 2019

gunnarmorling commented Dec 5, 2019

gunnarmorling left a comment • edited by Naros

Choose a reason for hiding this comment

Naros commented Dec 5, 2019 • edited

gunnarmorling commented Dec 5, 2019

Naros commented Dec 5, 2019

Naros commented Dec 5, 2019 • edited

gunnarmorling commented Dec 10, 2019

Naros commented Oct 15, 2019 •

edited

Naros commented Nov 12, 2019 •

edited

Naros Dec 4, 2019 •

edited

gunnarmorling left a comment •

edited by Naros

Naros commented Dec 5, 2019 •

edited

Naros commented Dec 5, 2019 •

edited