KSQL application for denormalizing data for data warehouses #10251

anhpngt · 2024-02-27T05:36:10Z

My current task requires to extract data (w. kafka-connect) from several postgres tables, perform joins (w. ksqlDB), then insert those denormalized data into a data warehouse (also a postgres table in this case). Each ETL pipeline may involve joining 3-4 tables, and each table has ~1M rows. I can't use stream-stream or windowed JOIN, since there are no constraints on the time range when joining records. Therefore, all the JOIN are full table-table joins.

An example of my ksqlDB queries:

CREATE TABLE sink1 AS SELECT ... FROM source1 JOIN source2 ...;
CREATE TABLE sink2 AS SELECT ... FROM sink1 JOIN source3 ...;
-- etc...
CREATE SINK CONNECTOR sinkconnector WITH (...);

My current setup (replica count/CPU/memory) is something like this on k8s:

# Units: mCPU and MB
NAME↑                                    READY STATUS     AGE    RESTARTS CPU   MEM PF CPU/R:L     MEM/R:L %CPU/R %CPU/L %MEM/R %MEM/L 
ksqldb-server-558c658d5b-7zm5q           1/1   Running    67m           3 131 20457 ●   2000:0 22528:22528      6    n/a     90     90 
ksqldb-server-558c658d5b-f7wbt           1/1   Running    8d            1 127 21162 ●   2000:0 22528:22528      6    n/a     93     93 
ksqldb-server-558c658d5b-l82x4           1/1   Running    6d21h         2 143 20159 ●   2000:0 22528:22528      7    n/a     89     89 
ksqldb-server-558c658d5b-t8xzg           1/1   Running    59m           0 290 20844 ●   2000:0 22528:22528     14    n/a     92     92 
schema-registry-6d4f7f5765-kcnrz         1/1   Running    10d           3   3   336 ●  10:1000    400:1024     30      0     84     32 
kafka-0                                  1/1   Running    10d           0 397  3916 ●    500:0   4096:4096     79    n/a     95     95 
kafka-connect-8569b5dd48-9d2bk           2/2   Running    6d17h         0  34  2896 ●    550:0   3104:3272      6    n/a     93     88

I'm observing the ksqldb server often get OOMKilled or high saturation. It also takes a long time to process each new record coming from source tables that are very large.

Could increasing cpu/memory for kafka or ksqldb-server be helpful here? Or should I have chosen another solution like batch processing instead?

The text was updated successfully, but these errors were encountered:

anhpngt added needs-triage question labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KSQL application for denormalizing data for data warehouses #10251

KSQL application for denormalizing data for data warehouses #10251

anhpngt commented Feb 27, 2024 •

edited

KSQL application for denormalizing data for data warehouses #10251

KSQL application for denormalizing data for data warehouses #10251

Comments

anhpngt commented Feb 27, 2024 • edited

anhpngt commented Feb 27, 2024 •

edited