Timeout while acquiring lock (15 waiting locks)
See original GitHub issueHi (-:
kafkajs@1.15.0
We are using KafkaJS, it’s solid and works great, but sometimes it doesn’t. It seems like, sometimes, in small traffic spikes (but not always) we get those errors thrown out when trying to send messages (produce messages):
“KafkaJSLockTimeout Timeout while acquiring lock (15 waiting locks)”
When this happens it’s about 30 seconds, and many messages are dropped and never get persisted:

The whole errors looks like this:

We use Heroku, with 3 Large Dynos (In Private Space). We use Confluent fully managed Kafka on a type:standard production cluster. We are using Node with Express.
This is the code that connects to the cluster. Runs once at the start of each Dyno. Only after the connect() we start listening to requests:
await kafkaProducer.connect();
// only after a successful connection, we start serving requests.
app.listen(app.get("port"), function () {
sendLog(`heroku dyno running ${process.env.HEROKU_RELEASE_VERSION} ${process.env.HEROKU_SLUG_COMMIT}`);
console.log("Node app is running on port", app.get("port"));
});
This is the configuration of the producer:
import Kafka from 'kafkajs';
const kafka = new Kafka.Kafka({
brokers: ['pkc-lq8v7.eu-central-1.aws.confluent.cloud:9092'],
clientId: 'lkc-9dgq0',
ssl: true,
sasl: { mechanism: 'plain', password: 'xxxxxx', username: 'xxxxxx' },
});
const producer = kafka.producer();
export const kafkaProducer = producer;
This is the code that sends the messages:
try {
messages = messages.map(m => { return { value: JSON.stringify(m) } });
const res = await kafkaProducer.send({ topic, messages });
} catch (e) {
logException(e, context);
}
The error in the top is being caught in this catch block.
Those errors here happens on this topic in Confluent:

This topic has 2 partitions. Just raising this up because maybe it’s related? The default partitions by Confluent is 6. We chose 2.
So it these errors related to a high burst of traffic, maybe increasing the number of partitions? And if not, could it be the code that connects to cluster? Maybe add some additional retry or other configuration we are missing here?
Any help on the issue and directions for further investigation would be much appreciated! 🙏
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
Top Related StackOverflow Question
@tulios very interesting!
So I’ve increased the timeout on both parameters to 5000:
I’ll keep you updated if it works. Thanks
Closing since this seems to have just been an issue with connection timeouts. The next release will increase the default authentication timeout.