Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSQL application for denormalizing data for data warehouses #10251

Open
anhpngt opened this issue Feb 27, 2024 · 0 comments
Open

KSQL application for denormalizing data for data warehouses #10251

anhpngt opened this issue Feb 27, 2024 · 0 comments

Comments

@anhpngt
Copy link

anhpngt commented Feb 27, 2024

My current task requires to extract data (w. kafka-connect) from several postgres tables, perform joins (w. ksqlDB), then insert those denormalized data into a data warehouse (also a postgres table in this case). Each ETL pipeline may involve joining 3-4 tables, and each table has ~1M rows. I can't use stream-stream or windowed JOIN, since there are no constraints on the time range when joining records. Therefore, all the JOIN are full table-table joins.

An example of my ksqlDB queries:

CREATE TABLE sink1 AS SELECT ... FROM source1 JOIN source2 ...;
CREATE TABLE sink2 AS SELECT ... FROM sink1 JOIN source3 ...;
-- etc...
CREATE SINK CONNECTOR sinkconnector WITH (...);

My current setup (replica count/CPU/memory) is something like this on k8s:

# Units: mCPU and MB
NAME↑                                    READY STATUS     AGE    RESTARTS CPU   MEM PF CPU/R:L     MEM/R:L %CPU/R %CPU/L %MEM/R %MEM/L 
ksqldb-server-558c658d5b-7zm5q           1/1   Running    67m           3 131 20457 ●   2000:0 22528:22528      6    n/a     90     90 
ksqldb-server-558c658d5b-f7wbt           1/1   Running    8d            1 127 21162 ●   2000:0 22528:22528      6    n/a     93     93 
ksqldb-server-558c658d5b-l82x4           1/1   Running    6d21h         2 143 20159 ●   2000:0 22528:22528      7    n/a     89     89 
ksqldb-server-558c658d5b-t8xzg           1/1   Running    59m           0 290 20844 ●   2000:0 22528:22528     14    n/a     92     92 
schema-registry-6d4f7f5765-kcnrz         1/1   Running    10d           3   3   336 ●  10:1000    400:1024     30      0     84     32 
kafka-0                                  1/1   Running    10d           0 397  3916 ●    500:0   4096:4096     79    n/a     95     95 
kafka-connect-8569b5dd48-9d2bk           2/2   Running    6d17h         0  34  2896 ●    550:0   3104:3272      6    n/a     93     88 

I'm observing the ksqldb server often get OOMKilled or high saturation. It also takes a long time to process each new record coming from source tables that are very large.

Could increasing cpu/memory for kafka or ksqldb-server be helpful here? Or should I have chosen another solution like batch processing instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant