Materialize rehydration of Kafka/Kinesis from S3? #9154

danthegoodman1 · 2021-11-17T15:55:42Z

danthegoodman1
Nov 17, 2021

Example scenario: I have data stored in Kafka or Kinesis for only 7 days before it is automatically purged, but I also copy all messages into S3 for archival in close to real time.

Would I be able to tell Materialize that I want to to read from Kafka/Kinesis while it can, however if it were to restart, backfill the >7day old messages from S3, without pulling all of the current info from S3?

Specifically I want to avoid joining S3 on data less than 7 days old, or if we had already gotten it from Kinesis/Kafka (i.e. older than 7 days, but we have not restarted so we still have the relevant kafka/kinesis data in memory). Put another way I only want to rehydrate from S3 if we restart.

If this is not clear I apologize, let me know how I can clarify!

Answered by bobbyiliev

Dec 3, 2021

Hi there,

This should be doable with a UNION ALL statement where you would filter out the latest records from the S3 bucket and only pull all the S3 data that is not available in Kenesis/Kafka.

The SELECT statement would look something like this:

SELECT * FROM s3_source
  WHERE event_timestamp < (SELECT min(event_timestamp) FROM kinesis_source)
  UNION ALL
  SELECT * FROM kinesis_source

First, we select all of the records from the S3 bucket and we filter them out with a subquery so we get the records from S3 that are not present in Kafka/Kinesis
Then we use UNION ALL to union the S3 data with all currently available records from Kenesis.

That way, even if you restart the historical data…

View full answer

bobbyiliev · 2021-12-03T14:50:44Z

bobbyiliev
Dec 3, 2021
Maintainer

Hi there,

This should be doable with a UNION ALL statement where you would filter out the latest records from the S3 bucket and only pull all the S3 data that is not available in Kenesis/Kafka.

The SELECT statement would look something like this:

SELECT * FROM s3_source
  WHERE event_timestamp < (SELECT min(event_timestamp) FROM kinesis_source)
  UNION ALL
  SELECT * FROM kinesis_source

First, we select all of the records from the S3 bucket and we filter them out with a subquery so we get the records from S3 that are not present in Kafka/Kinesis
Then we use UNION ALL to union the S3 data with all currently available records from Kenesis.

That way, even if you restart the historical data will be pulled from the S3 bucket, but the latest records would come from Kenesis.

Here is a quite useful case study on the same topic:

https://github.com/dbanalyticsco/materialize-backfill

I believe that we are planning to post a tutorial based on the above case study.

Hope that this helps! Let me know if you have any questions.

4 replies

danthegoodman1 Dec 3, 2021
Author

Amazing! Thank you!

danthegoodman1 Dec 3, 2021
Author

@bobbyiliev Won't this continue to read from S3 as new data is added though?

ruf-io Dec 10, 2021

@danthegoodman1 by default, Materialize will do a one-time scan of the S3 bucket when you create a materialized source or view against it.

You can add extra configuration parameters to your Source definition in order to continually listen for new objects in the bucket but that would be unnecessary in this UNION approach.

If you just create a regular S3 source, the main thing you need to ensure is that there is always overlap between S3 messages and Kinesis messages. Then, every time you restart Materialize or create a new unioned view, Materialize will rescan both Kinesis and the S3 bucket, filter out the S3 events that it already found in Kinesis, and keep adding Kinesis messages as they appear.

Keep in mind if you have years of data in S3 it will take some time to process initially.

danthegoodman1 Dec 10, 2021
Author

Perfect, thanks so much for clarifying!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialize rehydration of Kafka/Kinesis from S3? #9154

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Materialize rehydration of Kafka/Kinesis from S3? #9154

danthegoodman1 Nov 17, 2021

Replies: 1 comment · 4 replies

bobbyiliev Dec 3, 2021 Maintainer

danthegoodman1 Dec 3, 2021 Author

danthegoodman1 Dec 3, 2021 Author

ruf-io Dec 10, 2021

danthegoodman1 Dec 10, 2021 Author

danthegoodman1
Nov 17, 2021

Replies: 1 comment 4 replies

bobbyiliev
Dec 3, 2021
Maintainer

danthegoodman1 Dec 3, 2021
Author

danthegoodman1 Dec 3, 2021
Author

danthegoodman1 Dec 10, 2021
Author