Caffeine cache causing memory leak and OOM

Description

Recently, we’ve updated our services to Java 17 (v 17.0.1), which also forced us to use the NewRelic java agent version 7.4.0.

After the migration, we’ve noticed our services regularly crash due to OOM. First, we’ve suspected some issue with Java 17, but after obtaining several heap dumps, it looks like it’s a problem with Caffeine caching introduced in 7.3.0 #258 (we’ve migrated from an older version of the agent).

We believe it might be related to the MariaDB driver we’re using. In one of the heapdumps, the com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.WS map has 2.1 mil entries of prepared statements. It could be the case as it would rely on mariadb-java-client implementation of PreparedStatement which does not implement equals/hashCode.

After doing two other heapdumps, one right after startup of the app and one an hour later, this is the comparison:

Looks, like the cache contained only PreparedStatement entries.

Expected Behavior

Agent and service running smoothly, no memory leak.

Troubleshooting or NR Diag results

Steps to Reproduce

Your Environment

NewRelic agent: 7.4.0 Java: eclipse-temurin:17 docker image (currently 17.0.1) Spring Boot: 2.5.6 MariaDB driver: 2.7.4 (managed by Spring Boot BOM)

Additional context

Issue Analytics

State:
Created 2 years ago
Reactions:10
Comments:61 (30 by maintainers)

Top GitHub Comments

5reactions

ben-manescommented, Dec 3, 2021

Caffeine’s fix is released in v2.9.3 and v3.0.5. This will recover if the executor malfunctions, but affected users should apply a workaround for the JDK bug.

4reactions

Stephan202commented, Nov 18, 2021

Today the application used less memory, though there still was a slight upward trend. I created two heap dumps; in the mean time the application was restarted for unrelated reasons. Key findings:

The WS instance referenced by ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache that contains the prepared statements is now small (just 267776 and 37752 bytes retained, respectively), with keyReferenceQueue.queueLength == 0, as it should be. 🎉

All other ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCache instances are now also small (on the order of kilobytes). I went back to yesterday’s heap dump and saw that a few other such instances were also larger back then (on the order of megabytes).

All in all: whatever the issue/bug is, Caffeine.executor(Runnable::run) avoids it. ✔️
Two other WS instances that take up a significant amount of memory remain:
- Instances referenced by JdbcHelper.connectionToURL comprising 22.17% / 8.12% of the heap, with a keyReferenceQueue.queueLength of 616446 / 102386.
- Instances referenced by JdbcHelper.connectionToIdentifier comprising 18.47% / 5.74% of the heap, with a keyReferenceQueue.queueLength of 501032 / 76343.
IIUC these caches are created by AgentCollectionFactory.createConcurrentWeakKeyedMap, for which we did not set Caffeine.executor(Runnable::run). It’s safe to say that these two caches explain why memory usage was still trending upwards today.

As previously discussed, the next step would now be to test whether the issue is specific to Java 17, by downgrading to (e.g.) Java 16. Unfortunately my schedule today was even busier than expected, and tomorrow will certainly be no better. So instead I have started a different experiment that took me a bit less time to set up: I now updated the New Relic Agent to use Executors.newCachedThreadPool() rather than Runnable::run (this time also for the cache in AgentCollectionFactory). This experiment should hopefully tell us whether (a) the issue is somehow related to any kind of scheduling/thread switching or (b) it is more likely to be specific to ForkJoinPool.commonPool().

Stay tuned 😃

Top Results From Across the Web

What happens when Caffeine runs out of memory?

Caffeine won't catch OutOfMemoryError in order to suppress and evict entries. When that error is thrown the JVM is left in an unpredictable ......

Java agent v7.5.0 - New Relic Documentation

Update to jfr-daemon 1.7.0 - fixes a memory leak condition by cleaning ... Address Caffeine cache causing memory leak and OOM condition 593...

Latency Improvement with In-Memory Caching using Caffeine

Cache selection. Caching seems like a simple idea — put actively used data in memory which a system can access very quickly and...

Managing an OutOfMemory Exception Caused by Thread ...

A thread leak is causing a memory shortage at the server, which will cause the JVM process to throw out an OOM error....

Query Settings in SolrConfig | Apache Solr Reference Guide 8.3

The CaffeineCache is an implementation backed by the Caffeine caching library. ... which will reduce cache memory usage but may instead cause thrashing...