Caffeine cache causing memory leak and OOM
See original GitHub issueDescription
Recently, we’ve updated our services to Java 17 (v 17.0.1), which also forced us to use the NewRelic java agent version 7.4.0.
After the migration, we’ve noticed our services regularly crash due to OOM. First, we’ve suspected some issue with Java 17, but after obtaining several heap dumps, it looks like it’s a problem with Caffeine caching introduced in 7.3.0 #258 (we’ve migrated from an older version of the agent).
We believe it might be related to the MariaDB driver we’re using. In one of the heapdumps, the com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.WS map has 2.1 mil entries of prepared statements. It could be the case as it would rely on mariadb-java-client implementation of PreparedStatement which does not implement equals/hashCode.
After doing two other heapdumps, one right after startup of the app and one an hour later, this is the comparison:

Looks, like the cache contained only PreparedStatement entries.
Expected Behavior
Agent and service running smoothly, no memory leak.
Troubleshooting or NR Diag results
Steps to Reproduce
Your Environment
NewRelic agent: 7.4.0
Java: eclipse-temurin:17 docker image (currently 17.0.1)
Spring Boot: 2.5.6
MariaDB driver: 2.7.4 (managed by Spring Boot BOM)
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Reactions:10
- Comments:61 (30 by maintainers)
Top Related StackOverflow Question
Caffeine’s fix is released in v2.9.3 and v3.0.5. This will recover if the executor malfunctions, but affected users should apply a workaround for the JDK bug.
Today the application used less memory, though there still was a slight upward trend. I created two heap dumps; in the mean time the application was restarted for unrelated reasons. Key findings:
The
WSinstance referenced byExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCachethat contains the prepared statements is now small (just 267776 and 37752 bytes retained, respectively), withkeyReferenceQueue.queueLength == 0, as it should be. 🎉All other
ExtensionHolderFactoryImpl.ExtensionHolderImpl.instanceCacheinstances are now also small (on the order of kilobytes). I went back to yesterday’s heap dump and saw that a few other such instances were also larger back then (on the order of megabytes).All in all: whatever the issue/bug is,
Caffeine.executor(Runnable::run)avoids it. ✔️Two other
WSinstances that take up a significant amount of memory remain:JdbcHelper.connectionToURLcomprising 22.17% / 8.12% of the heap, with akeyReferenceQueue.queueLengthof 616446 / 102386.JdbcHelper.connectionToIdentifiercomprising 18.47% / 5.74% of the heap, with akeyReferenceQueue.queueLengthof 501032 / 76343.IIUC these caches are created by
AgentCollectionFactory.createConcurrentWeakKeyedMap, for which we did not setCaffeine.executor(Runnable::run). It’s safe to say that these two caches explain why memory usage was still trending upwards today.As previously discussed, the next step would now be to test whether the issue is specific to Java 17, by downgrading to (e.g.) Java 16. Unfortunately my schedule today was even busier than expected, and tomorrow will certainly be no better. So instead I have started a different experiment that took me a bit less time to set up: I now updated the New Relic Agent to use
Executors.newCachedThreadPool()rather thanRunnable::run(this time also for the cache inAgentCollectionFactory). This experiment should hopefully tell us whether (a) the issue is somehow related to any kind of scheduling/thread switching or (b) it is more likely to be specific toForkJoinPool.commonPool().Stay tuned 😃