Monday, July 30, 2012

Network of Brokers Revisited

Are ActiveMQ Network of Brokers a reliable choice? As mentioned on the performance improvements post, network of brokers is a way to horizontally scale ActiveMQ. Is it a reliable choice, though? Looks increasingly unlikely based on our experience.

We switched to a KahaDB backed network of brokers configuration when our MS SQL Server backed master/slave configuration couldn't handle some heavy load. We didn't have shared filesystems (like NAS devices) so network of brokers was the only other failover option.

Our network connector is simple. One broker establishes a duplex connection to the other broker - only one broker has the connector.

Initially, the network of brokers ran well and was/is faster than before only suffering from an issue about once every three months. As the number of our queues grew (rapid development), the network of brokers became increasingly troublesome. It became clear that one broker (the receiver of the network connector) was so burdened by thread load from both the queues and the network connector (ActiveMQ, not the tcp/ip network connection) that it was doing nothing except being a burden on the other broker. We used a number of the vertical scaling features mentioned in the performance post here to bring that thread load under control and get both brokers back into operation.

The new configuration was running well until someone dumped 500k+ messages onto a couple of queues in a short amount of time. Even with the new configuration, the network connector had broken under this load. We'd seen this happen in a few of our heavy, repeated load tests, but thought it might be an artifact of the way we were running the tests. Sadly, it doesn't look like it was an artifact.

We now feel that under heavy load the network of brokers will lose connections on certain queues and the two brokers will work in a split brain setup - often with messages producers on one broker and consumers on the other. The fix is to restart a broker which resets the network connection and invokes failover behavior (consumers and producers on one broker). Expect some delay (up to a few minutes) in restarting if this happens during heavy load or rather with loads of db files as ActiveMQ/KahaDB has to read loads of file data.

The network of brokers was our configuration to handle failover and heavy load, but if it is unreliable during very heavy load, then it's not right for us.

What's the next step?  The next step is to configure a shared filesystem (using NFS v4) and try an active/passive configuration with shared KahaDB data store.

1 comment:

  1. Thank you JM for sharing your experiences here. They are really helpful. I just wanted to thank you in comment.

    ReplyDelete