Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK][PG Parity] Dynamic addition of table: Error encountered due to Catalog Version Mismatch: Unable to Open Relation with OID 16913 #22398

Closed
1 task done
shamanthchandra-yb opened this issue May 15, 2024 · 0 comments
Assignees
Labels
2024.1.1_blocker area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented May 15, 2024

Jira Link: DB-11300

Description

Please find stress report in JIRA

Test steps:

create a table table1
create 20 tables
deploy a connector, "filtered", table1

Create 20 tables
ALTER PUBLICATION run 20 + 20 times, add a table each time
Deploy sink connector

Run workload on all 40 tables
Verify sink

Observed below error in connector log:

2024-05-14 17:18:11,897 ERROR  Postgres|db_cdc|streaming  Producer failure   [io.debezium.pipeline.ErrorHandler]
com.yugabyte.util.PSQLException: ERROR: could not open relation with OID 16913
  Where: Catalog Version Mismatch: A DDL occurred while processing this query. Try again.
	at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
	at com.yugabyte.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1263)
	at com.yugabyte.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1163)
	at com.yugabyte.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:44)
	at com.yugabyte.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:160)
	at com.yugabyte.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:125)
	at com.yugabyte.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:82)
	at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$1.readPending(PostgresReplicationConnection.java:622)
	at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.processMessages(PostgresStreamingChangeEventSource.java:218)
	at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:180)
	at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:36)
	at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:271)
	at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:194)
	at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Source connector version

fourpointfour/ybdb-debezium:0.6

Connector configuration

add connector connector_name='ybconnector_regex' stream_id='7e2749cdb345938f8d400473f526210a' db_name='cdc_07cf27' connector_host='172.151.16.154' table_list=['test_cdc_55b8df'] {'name': 'ybconnector_regex', 'config': {'database.master.addresses': '172.151.24.220:7100,172.151.18.163:7100,172.151.30.178:7100', 'database.hostname': '172.151.24.220:5433,172.151.18.163:5433,172.151.30.178:5433', 'database.port': 5433, 'database.masterhost': '172.151.30.178', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_07cf27', 'snapshot.mode': 'never', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '10', 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', 'topic.prefix': 'db_cdc', 'plugin.name': 'pgoutput', 'slot.name': '7e2749cdb345938f8d400473f526210a_from_con_e6d1', 'publication.autocreate.mode': 'filtered', 'publication.name': 'pn_ybconnector_regex', 'table.include.list': 'public.test_cdc_55b8df'}}

YugabyteDB version

2.23.0.0-b325

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 15, 2024
@yugabyte-ci yugabyte-ci changed the title [CDCSDK][PG Parity] Dynamic addition pf table: Error encountered due to Catalog Version Mismatch: Unable to Open Relation with OID 16913 [CDCSDK][PG Parity] Dynamic addition of table: Error encountered due to Catalog Version Mismatch: Unable to Open Relation with OID 16913 May 15, 2024
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage labels May 15, 2024
siddharth2411 added a commit that referenced this issue May 29, 2024
Summary:
In Walsender, when the 1st DML arrives, we had a special startup logic wherein we were setting the yb_read_time to the value stored in field `record_id_commit_time` instead of using the record's commit_time. This creates a problem in case there is a restart and the 1st record comes from a table that was created after stream creation. Consider the following scenario:

  # create table named BEFORE
  # create slot
  # create table named AFTER
  # wait for pub refresh to happen
  # insert into AFTER  --> When WS receives this, the yb_read_time would have been reset from pub_refresh_time to consistent_snapshot_time, so as_of query will try to find table AFTER as of the consistent_snapshot_time and will hit the "could not open relation" error.

To fix this, we have removed the startup logic and simplified the logic to always use the record's commit_time to perform a cache refresh. Additionally, after initVirtualWAL RPC call from Walsender, we will set the yb_read_time to the `record_id_commit_time` field from cdc_state entry of replication slot.

Added some debug logs while shipping RELATION message from Walsender & while updating yb_read_time.
Jira: DB-11300

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*
 ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDynamicTableAdditionForTablesCreatedAfterStreamCreation'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDDLWithDynamicTableAddition'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDDLWithRestart'

Reviewers: stiwary, asrinivasan, skumar, sumukh.phalgaonkar

Reviewed By: asrinivasan

Subscribers: yql, ycdcxcluster, stiwary

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35228
siddharth2411 added a commit that referenced this issue May 30, 2024
…che refresh

Summary:
**Backport description:**
Had compilation failures while creating slot in java UTs since this [[ https://phorge.dev.yugabyte.com/D35189 | diff ]] has not been backported yet which renamed the slot creation method in java UTs. Fixed it by using the slot creation method present in 2024.1

**Original descrption:**
Original commit: ec7f2ef / D35228
In Walsender, when the 1st DML arrives, we had a special startup logic wherein we were setting the yb_read_time to the value stored in field `record_id_commit_time` instead of using the record's commit_time. This creates a problem in case there is a restart and the 1st record comes from a table that was created after stream creation. Consider the following scenario:

  # create table named BEFORE
  # create slot
  # create table named AFTER
  # wait for pub refresh to happen
  # insert into AFTER  --> When WS receives this, the yb_read_time would have been reset from pub_refresh_time to consistent_snapshot_time, so as_of query will try to find table AFTER as of the consistent_snapshot_time and will hit the "could not open relation" error.

To fix this, we have removed the startup logic and simplified the logic to always use the record's commit_time to perform a cache refresh. Additionally, after initVirtualWAL RPC call from Walsender, we will set the yb_read_time to the `record_id_commit_time` field from cdc_state entry of replication slot.

Added some debug logs while shipping RELATION message from Walsender & while updating yb_read_time.
Jira: DB-11300

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*
 ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDynamicTableAdditionForTablesCreatedAfterStreamCreation'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDDLWithDynamicTableAddition'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDDLWithRestart'

Reviewers: stiwary, asrinivasan, skumar, sumukh.phalgaonkar

Reviewed By: asrinivasan

Subscribers: stiwary, ycdcxcluster, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35393
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1.1_blocker area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants