-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KafkaJSProtocolError: The broker received an out of order sequence number #598
Comments
Here's a minimal reproducible case: const { Kafka, logLevel } = require('kafkajs')
const delay = Number(process.argv[2]) || 100
console.log({ delay })
const kafka = new Kafka({
clientId: 'test-app',
brokers: ['localhost:9092'],
})
const createProducer = async () => {
const producer = kafka.producer({
allowAutoTopicCreation: false,
idempotent: true,
maxInFlightRequests: 1,
})
await producer.connect()
producer.logger().setLogLevel(logLevel.INFO)
return producer
}
const sendMessages = async (producer) => {
setInterval(async () => {
const messageId = Math.random()
console.log(`sending message ${messageId}`)
try {
await producer.send({
topic: 'test-topic',
messages: [
{ value: `Hello KafkaJS user! ${messageId}` },
],
})
console.log(`confirmed message ${messageId}`)
} catch (err) {
console.error(`crash for message ${messageId}`)
console.error(err)
process.exit(1)
}
}, delay)
}
createProducer()
.then(sendMessages).catch((err) => {
console.error(err)
process.exit(1)
}) If this code runs with a "large delay" between request (like the default 100ms), everything is fine:
But if you run this code with a small delay, the out of
Here's the log output:
error in the kafka logs:
As soon as 2 consecutive "confirmed message id" log lines appear, the next send crash. In the end, I don't think the problem is about "incrementing by one" in |
I've registered to the Slack channel and found some other users who faced the same issue in the past. I'm adding links here so they can be contacted at a later time if progress is made: by Vykimo https://app.slack.com/client/TF528A1BJ/CF6RFPF6K/thread/CF6RFPF6K-1556120666.002100 |
Hi @Delapouite, thanks for the detailed investigation. Most of the team is already on vacation, so bear with me 😄 We implemented transactions right after it was out, so we might have gotten some things wrong. I will try to get the docs around it to confirm some things, can you make the changes you proposed locally and try that? I could also help you land a PR |
Hey @Delapouite @tulios, |
No solutions found on my side yet. I had to disable idempotency for the moment. |
Thank you, just started using KafkaJS and started seeing this issue. Disabling idempotency got us through. I noticed that the producer is pretty much dead after receiving this error. |
Yeah. This is also affecting us. Which is kinda bad, since the only solution right know is to disable idempotency. :/. |
Any progress on that issue? |
Not as far as I'm aware, but we are open to contributions if anyone is willing to invest some time into it. |
I was able to circumvent this issue with some naive throttling: import {Producer, Kafka, CompressionTypes, ProducerConfig} from 'kafkajs';
import {IAction, IQueuedRecord} from './types';
import {createActionMessage} from './create_action_message';
import {TopicAdministrator} from './topic_administrator';
import {isKafkaJSProtocolError} from './type_guard';
import Bluebird from 'bluebird';
import pino from 'pino';
import uuid from 'uuid';
export class ThrottledProducer {
public recordsSent = 0;
private producer: Producer;
private topicAdministrator: TopicAdministrator;
private isConnected: boolean = false;
private intervalTimeout: NodeJS.Timeout;
private createdTopics = new Set<string>();
private recordQueue: IQueuedRecord[] = [];
private isFlushing = false;
private logger: pino.Logger;
constructor(
protected kafka: Kafka,
protected producerConfig: Omit<
ProducerConfig,
'allowAutoTopicCreation' | 'maxInFlightRequests' | 'idempotent'
> & {maxOutgoingBatchSize?: number; flushIntervalMs?: number} = {
maxOutgoingBatchSize: 10000,
flushIntervalMs: 1000
},
topicAdministrator?: TopicAdministrator,
logger?: pino.Logger
) {
this.topicAdministrator = topicAdministrator || new TopicAdministrator(kafka);
this.logger = logger
? logger.child({class: 'KafkaSagasThrottledProducer'})
: pino().child({class: 'KafkaSagasThrottledProducer'});
this.createProducer();
}
// tslint:disable-next-line: cyclomatic-complexity
public putAction = async <Action extends IAction>(action: Action) => {
if (!this.isConnected) {
throw new Error('You must .connect before producing actions');
}
if (!this.createdTopics.has(action.topic)) {
this.logger.debug({topic: action.topic}, 'Creating topic');
await this.topicAdministrator.createTopic(action.topic);
this.createdTopics.add(action.topic);
}
return new Promise<void>((resolve, reject) => {
this.recordQueue = [
...this.recordQueue,
{
resolve,
reject,
record: {
topic: action.topic,
messages: [createActionMessage({action})]
}
}
];
return;
});
};
public connect = async () => {
if (this.isConnected) {
return;
}
const flushIntervalMs = this.producerConfig.flushIntervalMs || 1000;
this.logger.debug('Connecting producer');
await this.producer.connect();
this.logger.debug('Connected producer');
this.logger.debug({flushIntervalMs}, 'Creating flush interval');
this.intervalTimeout = setInterval(this.flush, flushIntervalMs);
this.logger.debug('Created flush interval');
this.isConnected = true;
};
public disconnect = async () => {
if (!this.isConnected) {
return;
}
this.logger.debug('Disconnecting');
clearInterval(this.intervalTimeout);
await this.producer.disconnect();
this.logger.debug('Disconnected');
this.isConnected = false;
};
private createProducer = () => {
this.logger.debug('Creating a new producer');
this.producer = this.kafka.producer({
maxInFlightRequests: 1,
idempotent: true,
...this.producerConfig
});
this.logger.debug('Created a new producer');
};
// tslint:disable-next-line: cyclomatic-complexity
private flush = async (
retryRecords?: IQueuedRecord[],
retryCounter = 0,
retryBatchId?: string
) => {
if (!retryRecords && this.isFlushing) {
return;
}
if (retryCounter) {
/** Wait for a max of 30 seconds before retrying */
const retryDelay = Math.min(retryCounter * 1000, 30000);
this.logger.debug({retryDelay}, 'Waiting before attempting retry');
await Bluebird.delay(retryDelay);
}
/**
* Ensures that if the interval call ends up being concurrent due latency in sendBatch,
* unintentinally overlapping cycles are deferred to the next interval.
*/
this.isFlushing = true;
const batchSize = this.producerConfig.maxOutgoingBatchSize || 1000;
const outgoingRecords = retryRecords || this.recordQueue.slice(0, batchSize);
this.recordQueue = this.recordQueue.slice(batchSize);
const batchId = retryBatchId || uuid.v4();
if (!outgoingRecords.length) {
this.logger.debug({batchId}, 'No records to flush');
this.isFlushing = false;
return;
}
this.logger.debug(
{
remaining: this.recordQueue.length,
records: outgoingRecords.length,
batchId
},
'Flushing queue'
);
try {
await this.producer.sendBatch({
topicMessages: outgoingRecords.map(({record}) => record),
acks: -1,
compression: CompressionTypes.GZIP
});
this.recordsSent += outgoingRecords.length;
this.logger.debug({batchId}, 'Flushed queue');
outgoingRecords.map(({resolve}) => resolve());
this.isFlushing = false;
return;
} catch (error) {
/**
* If for some reason this producer is no longer recognized by the broker,
* create a new producer.
*/
if (isKafkaJSProtocolError(error) && error.type === 'UNKNOWN_PRODUCER_ID') {
await this.producer.disconnect();
this.createProducer();
await this.producer.connect();
this.logger.debug(
{batchId},
'Retrying failed flush attempt due to UNKNOWN_PRODUCER_ID'
);
await this.flush(outgoingRecords, retryCounter + 1, batchId);
return;
}
outgoingRecords.map(({reject}) => reject(error));
this.isFlushing = false;
return;
}
};
}
|
Also hitting this issue. (Using Kafka Streams, when trying to pre-seed a KTable.) I'm going to try the throttle approach to get past this immediate hurdle, but this is a serious problem -- if |
👀 seeing with latest kafkajs |
experiencing the same issue in |
This is with me. I think the issue is that the proposed fix is a step forward but still has issues with failures. I will progress. |
Any estimated date for this to be released ? |
I can confirm that #1050 fixes this issue in normal situation, but if the broker restarts, these errors start happening again until the producer is restarted. |
Oh, my fork is 11 days old, and I see newer commits in your branch. Will check once more, thanks. |
In the logs of There are hundreds of log lines
followed by reconnect, authentication, and hundreds of
then hundreds of
Then there are hundreds of produce requests (seem to have been buffered during disconnect)
Some of these produce requests were successful, but others resulted in
|
Is this a high volume producer? How many produce calls per second? How
many brokers? How many topic partitions?
I ask because i believe there is still an issue with error revovery when
the load is high.
…On Thu, 12 Aug 2021, 12:28 Daniel Lytkin, ***@***.***> wrote:
In the logs of kafkajs, I see reconnects to broker prior to this error.
The broker did not restart.
There are hundreds of log lines
Cluster has disconnected, reconnecting: Closed connection
followed by reconnect, authentication, and hundreds of
Request Metadata(key: 3, version: 6)
then hundreds of
Response Metadata(key: 3, version: 6)
Then there are hundreds of produce requests (seem to have been buffered
during disconnect)
Request Produce(key: 0, version: 7)
Some of these produce requests were successful, but others resulted in
Response Produce(key: 0, version: 7)
error: "The broker received an out of order sequence number"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#598 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDLW5WHWB3DJTVOADA23JTT4NEVZANCNFSM4J2NVGAQ>
.
|
About 1-2 produce calls per second per instance, with total 50 instances. There are 3 brokers. The topic has 20 partitions and 3 replicas. |
Sure, thanks. Will post back when I have any results. |
Hello @aikoven Did you manage to get any results? We are experimenting the same problem as everybody above, and the way the code runs we really need to have the delivery exactly once :/ |
@anlauren I'm on a vacation right now, will return to this issue somewhere next week. |
@t-d-d after upgrading to the latest code from #1172 and running for few days, I don't see the "out of order" errors any more. However, there is another issue: after some time we start getting a very high rate of created connections. I collected the logs spanning one second, and it says |
It seems that this bug was present in older versions of |
I'm using my own fork that includes the fixes from both of these PRs. Haven't seen any issues for quite a while. |
Gotcha. Note that I solved my own issue by using transactions as described here: https://kafka.js.org/docs/transactions |
Olá a todos, estava enfrentando esse mesmo problema com a seguinte configuração de produtor:
Eu basicamente tinha que verificar entre meus objetos, quais ficaram 1 minuto sem entrar em contato e enviar as informações de cada um deles pelo kafka da seguinte forma:
No entanto, somente o primeiro objeto era enviado, pois eu realizava essa tarefa em um forEach. Só parei de ter o erro em questão e os objetos foram enviados corretamente após remover o |
It sounds like forks using the open PRs do not have issues. Is there any chance we can get those merged and a new version released? Or is there something holding them back? |
@Nevon It worked! Thank you very much! This is a huge help. |
Hi
I have a single producer:
As you can see, I'm using the
idempotent
feature introduced in #203 I'm also forcing themaxInFlightRequests
to1
It works fine when my producer sends messages to the kafka server at a slow pace. By that I mean no requests are stored in the
RequestQueue
.But during spike in activity, sometimes messages are automatically sent in batch.
Kafka then returns the following error:
In this bug case, we have
156
→158
jump, because 2 messages where sent in the batch.According the kafka java code, it expects only a 1 unit jump so
156
→157
:https://github.com/apache/kafka/blob/fecb977b257888e2022c1b1e04dd7bf03e18720c/core/src/main/scala/kafka/log/ProducerStateManager.scala#L241-L251
My current understanding/intuition, is that the
firstSequence
number should always be incremented by1
and not by the number of messages present in the batch request.Which roughly means that the
updateSequence
method of theEOSManager
should not take anincrement
param but always increment by 1 only.kafkajs/src/producer/eosManager/index.js
Lines 171 to 197 in eae5ea4
I'm still quite noob on this codebase so my reasoning is probably flawed.
@ianwsperber , as you're the main implementer of the eos code, what's your take on this? Thanks a lot.
The text was updated successfully, but these errors were encountered: