Wednesday, April 24, 2013

Performance Issues in ActiveMQ

There are times that we've seen messages taking a long time to be consumed by active, fast consumers. The consumer isn't working hard and the message queue isn't under load, but the messages aren't being processed quickly (in our case at 20% the normal pace).

Using the activemq-admin command (see the post here), we queried the stats on the queue.  The results confirmed our observations and pointed to the problem.  The output was like this:

For reference (see this link on some stats meanings and this similar link):
EnqueueCount =  number of messages sent and committed to the queue
DequeueCount = number of messages read from queue and committed or acknowledged by consumer
DispatchCount = number of messages read from the queue (DequeuedCount + InFlightCount)
InFlightCount = number of read messages waiting for acknowledgement from consumer

A relatively high DispatchCount or InFlightCount is not good.

The most notable instance of this slow message delivery occurred on our network of brokers cluster which runs two ActiveMQ servers with a network connection between the two.  Clients will connect to either broker as they wish. Signs of the issue were mostly that there was:
  • a backlog of messages waiting to be consumed (i.e. the producer was working fine)
  • the app had active consumers
  • messages were being consumed at what appeared to be 1/10th the normal speed
We were monitoring the consumer's processing time per message (about 1s per message - each message required plenty of work) which was much faster than the rate messages were being consumed.  The consumer app wasn't under heavy load, in fact, it wasn't working enough.  There was one other odd bit of behavior: the consumers were switching back and forth between the two brokers very quickly, every few seconds.

The interesting portion of this is the DequeueCount vs DispatchCount - DequeueCount is a count of the messages sent and committed as sent, DispatchCount is the number of messages sent to consumers regardless of whether they were committed.  Ideally, these two would match, but normally DispatchCount will be a little higher than DequeueCount.  Here DispatchCount is more than 10 times DequeueCount - messages are being sent again and again without being processed.  ActiveMQ has prefetch limits which might allow a consumer to grab 100s or even a 1000 messages at once - that's generally good for performance.  However, we'd seen the consumers moving from broker to broker every few seconds.  That meant the consumer wasn't able to process all those messages and on every reconnect would grab another group of messages that it couldn't process fast enough.

For this, we added randomize=false to the connection string on the consumer. This tells the consumer to prefer one broker and stick to that broker as much as possible.  See more here: ActiveMQ Failover Transport  The first broker listed will have priority with randomize=false.  Setting this solved the problem quickly when we really needed it and has kept things running much better.

Since then, we have occasionally seen instances where consumers with randomize=false are still struggling.  Sometimes this has required a restart (of the app or of the broker), but this is rarer.  When this has happened, we've noticed that we've had a large number of clients disconnect which appears to be doing something to the broker connection - a rolling restart of the brokers fixes the problem quickly.  The issue seems to be that the consumer is respecting the randomize=false and sticks to one broker, but the messages are on the other broker and moving slowly; the broker restarts fix the connection and tend to move the messages to the consumer's broker.