Timeout while acquiring lock (15 waiting locks)

Hi (-:

kafkajs@1.15.0

We are using KafkaJS, it’s solid and works great, but sometimes it doesn’t. It seems like, sometimes, in small traffic spikes (but not always) we get those errors thrown out when trying to send messages (produce messages):

“KafkaJSLockTimeout Timeout while acquiring lock (15 waiting locks)”

When this happens it’s about 30 seconds, and many messages are dropped and never get persisted:

CleanShot 2021-09-29 at 09 03 10@2x

The whole errors looks like this:

CleanShot 2021-09-29 at 09 05 01@2x

We use Heroku, with 3 Large Dynos (In Private Space). We use Confluent fully managed Kafka on a type:standard production cluster. We are using Node with Express.

This is the code that connects to the cluster. Runs once at the start of each Dyno. Only after the connect() we start listening to requests:

await kafkaProducer.connect();

  // only after a successful connection, we start serving requests.

  app.listen(app.get("port"), function () {
    sendLog(`heroku dyno running ${process.env.HEROKU_RELEASE_VERSION} ${process.env.HEROKU_SLUG_COMMIT}`);
    console.log("Node app is running on port", app.get("port"));
  });

This is the configuration of the producer:

import Kafka from 'kafkajs';

const kafka = new Kafka.Kafka({
    brokers: ['pkc-lq8v7.eu-central-1.aws.confluent.cloud:9092'],
    clientId: 'lkc-9dgq0',
    ssl: true,
    sasl: { mechanism: 'plain', password: 'xxxxxx', username: 'xxxxxx' },
});

const producer = kafka.producer();

export const kafkaProducer = producer;

This is the code that sends the messages:

try {
        messages = messages.map(m => { return { value: JSON.stringify(m) } });
        const res = await kafkaProducer.send({ topic, messages });        
    } catch (e) {        
        logException(e, context);
    }

The error in the top is being caught in this catch block.

Those errors here happens on this topic in Confluent:

CleanShot 2021-09-29 at 09 26 19@2x

This topic has 2 partitions. Just raising this up because maybe it’s related? The default partitions by Confluent is 6. We chose 2.

So it these errors related to a high burst of traffic, maybe increasing the number of partitions? And if not, could it be the code that connects to cluster? Maybe add some additional retry or other configuration we are missing here?

Any help on the issue and directions for further investigation would be much appreciated! 🙏

Issue Analytics

State:
Created 2 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

4reactions

yaronlevicommented, Sep 29, 2021

@tulios very interesting!

So I’ve increased the timeout on both parameters to 5000:

CleanShot 2021-09-29 at 12 02 37@2x

I’ll keep you updated if it works. Thanks

1reaction

Nevoncommented, May 3, 2022

Closing since this seems to have just been an issue with connection timeouts. The next release will increase the default authentication timeout.

Top Results From Across the Web

Timeout while acquiring lock (15 waiting locks) - Bountysource

We are using KafkaJS, it's solid and works great, but sometimes it doesn't.

[#KAFKA-6653] Delayed operations may not be completed ...

In this case, the operation may never be completed and will timeout unless there are other operations with the same key. The timeout...

Client Configuration - KafkaJS

Connection Timeout. Time in milliseconds to wait for a successful connection. The default value is: 1000 . new Kafka({ clientId ...

lock_timeout parameter - PostgreSQL Documentation

The time limit applies separately to each lock acquisition attempt. ... Unlike statement_timeout, this timeout can only occur while waiting for locks.

Asyncio Deadlocks in Python

... a Lock; Tips for Avoiding Deadlocks. Tip 1: Use Context Managers; Tip 2: Use Timeouts When Waiting; Tip 3: Acquire Locks in...