Monday, July 30, 2012

Network of Brokers Revisited

Are ActiveMQ Network of Brokers a reliable choice? As mentioned on the performance improvements post, network of brokers is a way to horizontally scale ActiveMQ. Is it a reliable choice, though? Looks increasingly unlikely based on our experience.

We switched to a KahaDB backed network of brokers configuration when our MS SQL Server backed master/slave configuration couldn't handle some heavy load. We didn't have shared filesystems (like NAS devices) so network of brokers was the only other failover option.

Our network connector is simple. One broker establishes a duplex connection to the other broker - only one broker has the connector.

Initially, the network of brokers ran well and was/is faster than before only suffering from an issue about once every three months. As the number of our queues grew (rapid development), the network of brokers became increasingly troublesome. It became clear that one broker (the receiver of the network connector) was so burdened by thread load from both the queues and the network connector (ActiveMQ, not the tcp/ip network connection) that it was doing nothing except being a burden on the other broker. We used a number of the vertical scaling features mentioned in the performance post here to bring that thread load under control and get both brokers back into operation.

The new configuration was running well until someone dumped 500k+ messages onto a couple of queues in a short amount of time. Even with the new configuration, the network connector had broken under this load. We'd seen this happen in a few of our heavy, repeated load tests, but thought it might be an artifact of the way we were running the tests. Sadly, it doesn't look like it was an artifact.

We now feel that under heavy load the network of brokers will lose connections on certain queues and the two brokers will work in a split brain setup - often with messages producers on one broker and consumers on the other. The fix is to restart a broker which resets the network connection and invokes failover behavior (consumers and producers on one broker). Expect some delay (up to a few minutes) in restarting if this happens during heavy load or rather with loads of db files as ActiveMQ/KahaDB has to read loads of file data.

The network of brokers was our configuration to handle failover and heavy load, but if it is unreliable during very heavy load, then it's not right for us.

What's the next step?  The next step is to configure a shared filesystem (using NFS v4) and try an active/passive configuration with shared KahaDB data store.

Sunday, July 15, 2012

KahaDB log files not clearing ActiveMQ

After we upgraded to ActiveMQ5.6.0 and changed a number of configuration options listed in the vertical scaling post, things were moving brilliantly (see the post on testing the new configuration as well).

However, after a couple of weeks, disk space on one of the brokers continued to grow. Looking at the data/kahadb directory, we saw that the log files back to log1 still existed.  This sounded like a frequent problem with ActiveMQ where it doesn't clear its log files after use (users seem to log this issue every other release).  Only the broker that received the duplex network connection was suffering; the broker that established that network connection was fine.  It seemed like a problem with the consuming of messages being acknowledged.

We turned on logging as detailed here: (see: http://activemq.apache.org/why-do-kahadb-log-files-remain-after-cleanup.html)
log4j.appender.kahadb=org.apache.log4j.RollingFileAppender 
log4j.appender.kahadb.file=${activemq.base}/data/kahadb.log 
log4j.appender.kahadb.maxFileSize=1024KB 
log4j.appender.kahadb.maxBackupIndex=5 
log4j.appender.kahadb.append=true 
log4j.appender.kahadb.layout=org.apache.log4j.PatternLayout 
log4j.appender.kahadb.layout.ConversionPattern=%d [%-15.15t] %-5p 
%-30.30c{1} - %m%n 
log4j.logger.org.apache.activemq.store.kahadb.MessageDatabase=TRACE, kahadb
and used jconsole's access to the reload log4j method on the broker mbean to reload the logging file.

This showed that the broker didn't find any log to be cleared - the first attempt to find a free log produced no candidates and the clearing failed.  This didn't really add much information except that there was a fundamental problem.  We asked a question on the ActiveMQ forums which was a useless as our previous questions.  This left us with two obvious options:
1) restart the troublesome broker
2) clear any needed messages from the broker, shut it down, clear off the KahaDB files and start fresh.

We started with 1, but after the broker started taking a while to load the many, many GBs of old logs, we grew concerned about message replay due to unacknowledged message consumption. So, we shut down (had already saved any pending messages), cleared off KahaDB and then started the broker again.

After several hours, the troubled broker looked healthy again and was clearing off early log files.  Issue closed for now!