Refactor: KafkaJS ingress #627

microwavekonijn · 2022-09-23T09:59:10Z

Currently we have our data ingress using RabbitMQ a message broker that distributes messages to queues when they come in. Kafka on the other hand is an event streaming platform that acts as a data pipeline. Where RabbitMQ and Kafka differ is that Kafka is meant to store messages given a certain retention time, which then in the window that the messages live they can be read. This allows for a more flexible approach to development of PS2Alerts. Mainly the same data can be used to create new features or fix bugs, and can be directly compared to the production version of PS2Alerts. In the case of OW where certain matches were missed and needed to be back-filled, would give with Kafka a more relaxed approach where a fix can be implemented and tested to then create the missing data.

Background

Kafka is by design a distributed service which means it is useful to introduce some of the inner workings of Kafka.

Topics

While RabbitMQ it is optional to use routing keys and use topics, Kafka always uses them as there is no queues in the traditional sense. A topic needs to be created before it can be used and will have a message retention time as config and a name(similar to routing keys in RMQ). Each topic is append only, meaning a message can only be appended to the end of the topic. As long as the message is retained it can be read and iterated over using a consumer.

Partitions

Each topic can be partitioned with various strategies on how to distribute the messages over the partitions. Partitions are a means to distribute the workload over multiple consumers. In the case that there is only one consumer all partitions will be used by that consumer, negating the benefit of having multiple partitions.

Group

When reading a topic we want to make sure that each message is only read once, hence the introduction of groups. A group is identified by a group id. In each group a consumer is a handed a subset of partitions of a topic, for each partition the group will remember the offset of the last message that has been read.

Implementation

The idea is to distribute messages over topics where each topic has one partitions as we only have one consumer. Each topic will have name that consist out of the event name of the messages it contains, the world id, and optionally the zone id. This last one makes it so we can just setup the consumers of the topics without the need to filter and receiving redundant messages.

Initially it will be a one-to-one replacement for RabbitMQ where afterwards the focus will lie on developing the back-fill functionality. Important is that the back-fill functionality is accompanied by the right properties of how the data is generated. For instance while a live alert will span 1.5h a back-fill can happen in a few minutes, maybe less. The crux is that the message of the end of the alert is also in our system. Also when back-filling there are certain things that cannot be done, like fetching map states or fetching characters that do not exists(when we receive the message on a live alert the chances of this happening will be very low). Furthermore there also needs to be consideration of processed data going to the API, for instance how to overwrite an existing alert and all its data. Also consider the workload on the API when back-filling as a lot of messages can come in at once. Maybe consider bulk processing.

A good addition would be the shift from data collection done between the collector and the aggregator to the collector fully. Think of character fetching on character events, or a cron job that fetches map states. Specially the last one gives us a topic with a historic overview of map states for a specific zone, which is also important for back-filling.

Caveats

With Kafka it is out of the question to retry messages by re-adding them to the same topic. This is as it is out of the scope of what the topic should provide, which is basically a historic accurate set of messages. Re-adding them would result in duplicate messages. Moreover some though need to be given to how to handle exceptions. In the case that the collector will fetch all the data it could greatly simplify exception handling. Mainly because it eliminates API exceptions while processing the data. This would mean that most errors are a result of a bug that is in need of fixing. Other services could also be the cause of exceptions due to lack of a connection, however RabbitMQ for publishing and Redis both have queues implemented to mitigate this. The only problem could be the internal API endpoint, however if Kafka collects the map state this dependency can be eliminated.

Benefits

While RabbitMQ is very configurable with regards to exchanges, queues, delays, and retries, Kafka comes right out of the box with a better system. Mainly that messages are retained whether or not they are successfully processed. In essence it is a sort of database.

Notes

What Kafka won't do is replace RabbitMQ entirely. The processed data will still be queued using RabbitMQ as it does not make sense to use Kafka as the functionality won't be used and the it is rather expensive to use this system.

There is also the question of data retention, though there is the question of resource usage. What I would say is that the retention off normal messages would be about 2-3 days. For special messages, from for instance OW, it can be longer like say a week(this data set will also be smaller and not grow after the match is done).

microwavekonijn added enhancement New feature or request discussion Issues requiring extra discussion labels Sep 23, 2022

microwavekonijn self-assigned this Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: KafkaJS ingress #627

Refactor: KafkaJS ingress #627

microwavekonijn commented Sep 23, 2022 •

edited

Refactor: KafkaJS ingress #627

Refactor: KafkaJS ingress #627

Comments

microwavekonijn commented Sep 23, 2022 • edited

Background

Topics

Partitions

Group

Implementation

Caveats

Benefits

Notes

microwavekonijn commented Sep 23, 2022 •

edited