Many Slow Consumers All At Once On ActiveMQ
Many Slow Consumers All At Once On ActiveMQ
In a previous post, we discussed consumer buffering and caching. We had to go a little deeper into that when we noticed that several of our consumers seem to all be running slowly when the entire system (broker, producers, and consumers) were all under extremely heavy load.
The initial thought was that we'd reached a hard limit in our system - max CPU, memory or even disk IO. However, none of those showed any real signs of stress or high values.
Digging a little deeper it seems that we have a few slow consumers, but that has always been the case without causing impact on other queues or consumers.
Combining the two issues and looking even deeper into the memory usage limits and consumer behaviors, we started to see a problem: heavy load plus a few high volume, slow consumers can cause a knock-on effect to many other consumers.
The key background has to do with the ActiveMQ memory limits and the way that the destination (queue) cursor works. ActiveMQ has a global memoryUsage setting - this is the max space it can use. It typically doesn't let you use it all as it needs to reserve some to make sure that the broker has enough space as keeping the broker up is more important.
Meanwhile, incoming messages for a persistent queue get assigned memory in the destination cursor. A somewhat healthy cursor looks like the image above - where it goes up and down as the messages come in and get read off. The image is from a test case of limited memoryUsage and large messages (and some settings described below) so one takes a decent amount of the overall space available.
In order to preserve the broker, the cursors are generally limited to only 70% of the broker memory usage. Once that has been hit, the cursors for destinations and consumers can be blocked on taking and sending more messages until the slow queues have reduced usage. Again, this shouldn't be a problem most of the time, especially on a broker with fast consumers or small, slow consumer queues. However, if a few slow consumers build up a large backlog, then it can affect all consumers slowing them down to the clearing rate of the slow consumers (btw, I'm talking about persistent queues here - non-persistent queues, topics, and older versions (before 5.5?) of ActiveMQ can have different behavior).
What's the fix? Basically, the fix is to limit the slow consumers from hogging the memory so there are a few things. Again, the memory allocated for all queue cursors is 70% of the global memoryUsage limit. One option is to increase that limit - this turns out to be the easiest way to deal with it!
Another option is to impose memory limits per queue. This is best done with memoryLimit (often in conjunction with producer flow control) like this:
<policyEntry queue=">" cursorMemoryHighWaterMark="50" memoryLimit="4 mb" producerFlowControl="true" >
Noticed that we've also added the cursorMemoryHighWaterMark - this defaults to 70% - it is the value that sets the 70% of the memoryUsage value. When the memoryLimit is added, it should then be applied to that memoryLimit rather than the global memoryUsage. In the above xml, we've applied it to all queues, but that could be changed to known slow queues. The memoryLimit and cursorMemoryHighWaterMark are not normally applied so that queues have the flexibility to use what they need. Producer flow control is on by default, btw. More details on these settings can be found here on the per-desitination-policies.
We aren't the only ones that have come across this problem - an ActiveMQ committer posted a test case so that you can try out various settings (some updates such as the pom.xml source/target and ActiveMQ version might be needed).
Our tests showed that the best and simplest solution was to increase memoryUsage. However, if that isn't workable or you want a more targeted solution, then memoryLimit and cursorMemoryHighWaterMark settings were good and even more so if you could target the specific slow queues. You'll see producer flow control kick in.
One interesting point to note though is that by watching the JMX MBeans, it was clear that the slow consumer in the above test case still consumed a much larger portion of memory than these settings should allow. So, we'll need to dig into the code more to understand the workings and why that happens. However, the memory limits and high water marks should help. Btw, we also noticed differences in behavior due to differences in ActiveMQ version, but that is to be expected.
If this still isn't enough, there are a few other settings to try on the queues. We tried prefetch and store cursor (<pendingQueuePolicy> <storeCursor/> </pendingQueuePolicy>) not to make a difference. These combined settings: might make a difference, but we didn't observe any in a simple test.
There's also a setting to try if none of the above is enough. At the broker level, the memory used by producers and consumers is shared/common. However, that can be split out using this setting: splitSystemUsageForProducersConsumers. This setting at the <broker> level splits the memory between the producter and consumers - 60/40 by default:
<broker splitSystemUsageForProducersConsumers producerSystemUsagePortion="60" consumerSystemUsagePortion="40" >
We didn't try that approach, but might come back to it if needed.
A few references if it helps: