Timeout while acquiring lock (15 waiting locks)

See original GitHub issue

Hi (-:

kafkajs@1.15.0

We are using KafkaJS, it’s solid and works great, but sometimes it doesn’t. It seems like, sometimes, in small traffic spikes (but not always) we get those errors thrown out when trying to send messages (produce messages):

“KafkaJSLockTimeout Timeout while acquiring lock (15 waiting locks)”

When this happens it’s about 30 seconds, and many messages are dropped and never get persisted:

CleanShot 2021-09-29 at 09 03 10@2x

The whole errors looks like this:

CleanShot 2021-09-29 at 09 05 01@2x

We use Heroku, with 3 Large Dynos (In Private Space). We use Confluent fully managed Kafka on a type:standard production cluster. We are using Node with Express.

This is the code that connects to the cluster. Runs once at the start of each Dyno. Only after the connect() we start listening to requests:

await kafkaProducer.connect();

  // only after a successful connection, we start serving requests.

  app.listen(app.get("port"), function () {
    sendLog(`heroku dyno running ${process.env.HEROKU_RELEASE_VERSION} ${process.env.HEROKU_SLUG_COMMIT}`);
    console.log("Node app is running on port", app.get("port"));
  });

This is the configuration of the producer:

import Kafka from 'kafkajs';

const kafka = new Kafka.Kafka({
    brokers: ['pkc-lq8v7.eu-central-1.aws.confluent.cloud:9092'],
    clientId: 'lkc-9dgq0',
    ssl: true,
    sasl: { mechanism: 'plain', password: 'xxxxxx', username: 'xxxxxx' },
});

const producer = kafka.producer();

export const kafkaProducer = producer;

This is the code that sends the messages:

try {
        messages = messages.map(m => { return { value: JSON.stringify(m) } });
        const res = await kafkaProducer.send({ topic, messages });        
    } catch (e) {        
        logException(e, context);
    }

The error in the top is being caught in this catch block.

Those errors here happens on this topic in Confluent:

CleanShot 2021-09-29 at 09 26 19@2x

This topic has 2 partitions. Just raising this up because maybe it’s related? The default partitions by Confluent is 6. We chose 2.

So it these errors related to a high burst of traffic, maybe increasing the number of partitions? And if not, could it be the code that connects to cluster? Maybe add some additional retry or other configuration we are missing here?

Any help on the issue and directions for further investigation would be much appreciated! 🙏

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
yaronlevicommented, Sep 29, 2021

@tulios very interesting!

So I’ve increased the timeout on both parameters to 5000:

CleanShot 2021-09-29 at 12 02 37@2x

I’ll keep you updated if it works. Thanks

1reaction
Nevoncommented, May 3, 2022

Closing since this seems to have just been an issue with connection timeouts. The next release will increase the default authentication timeout.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Timeout while acquiring lock (15 waiting locks) - Bountysource
We are using KafkaJS, it's solid and works great, but sometimes it doesn't.
Read more >
[#KAFKA-6653] Delayed operations may not be completed ...
In this case, the operation may never be completed and will timeout unless there are other operations with the same key. The timeout...
Read more >
Client Configuration - KafkaJS
Connection Timeout. Time in milliseconds to wait for a successful connection. The default value is: 1000 . new Kafka({ clientId ...
Read more >
lock_timeout parameter - PostgreSQL Documentation
The time limit applies separately to each lock acquisition attempt. ... Unlike statement_timeout, this timeout can only occur while waiting for locks.
Read more >
Asyncio Deadlocks in Python
... a Lock; Tips for Avoiding Deadlocks. Tip 1: Use Context Managers; Tip 2: Use Timeouts When Waiting; Tip 3: Acquire Locks in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found