Monday, September 28, 2015

Replicated LevelDB

We've recently decided to upgrade our ActiveMQ instances and thought we'd try replicated LevelDB with the hope that it is the answer to our shared storage problems (shared SQL store is too slow (on fast SQL Servers) and shared locking on NFS didn't lock).  We used three ActiveMQ brokers and three ZooKeeper instances.  The general setup and configuration followed the Replicated_LevelDB page. We also used HAproxy (or use another load balancer) as an interface to our front end so that one URL would support all the brokers.  There was one feature that we were not able to run with replicated LevelDB: delayed message queuing, but we could live with that.

This 3 node replicated configuration worked fine for weeks although occasionally one of the nodes would fail and we'd need to bring it back.  The logs indicated that it'd timed out and shut down.  Then we had a cascade of these shutdowns - node 1 would fail, then restart, then node 3, then node 2 - they were good in that they tried restarting, but after a few minutes of rapid failover and recovery, the nodes gave up and shut down entirely often leaving one node up, but not the master since a quorum wasn't available.

After restarting the nodes to get the system running again, we checked the logs and saw messages about ZooKeeper timeouts on the order of a few seconds (2-3secs).  These nodes are all in the same rack in the same data center - network times should always be 1-2ms (sometimes spiking at 7-10ms), not 1000 times higher.  We left the cluster again to observe this again.  Within a month, it happened again with very much the same situation.  For operational sanity, we turned off replicated LevelDB and are back to a non-HA solution while we investigate the solutions.  To put this in perspective, the problem could be our ZooKeepers and set up - we have had to put effort into tuning ZooKeeper in the past.  The ActiveMQ guys also mention (can't find the link now) that replicated LevelDB is cutting edge and might not be ready for full production use.

Sunday, March 1, 2015

ActiveMQ 6 - ActiveMQ + HornetQ

Having a look at the latest ActiveMQ releases (5.11 specifically although 5.11.1 is out) and the future of the Apollo project which uses a threading and messaging dispatch method (based on HawtDispatch the Java port of libdispatch/Grand Central reactor design), we started to wonder about when we'd see something new presumably in ActiveMQ 6. Looks like there is going to be something new coming.

ActiveMQ 6 looks to be a combination of HornetQ and ActiveMQ with more HornetQ flavor than ActiveMQ.  Since we're still using mostly JMS, HornetQ has been our most likely alternative to ActiveMQ should we want to change; HornetQ has also won some speed tests.  Turns out the two have more in common than we realized - not only is there a decent amount of overlap in capabilities, but both sets of developers largely work for the same company. Commercially ActiveMQ is part of RedHat's JBoss Fuse ESB and HornetQ is the open sourced child of RedHat's JBoss broker.

HornetQ's project manager seems to have raised the idea and the ActiveMQ dev community seems fairly eager to join forces rather than split mind share. Commits to ActiveMQ repos for version 6 look to be working through the incorporation of the code although the page on the official site (at the moment) still mentions Apollo as being the future core of AMQ 6. It also looks like the HornetQ dev forum are seeing the combination as extremely likely to happen.

So, what is going on with the Apollo reworking?  The main developer mentions that it could be the use of Scala that has caused slow uptake by other developers and he was considering porting to Java to fix that.  The HornetQ donation may have saved him the work.

All in all the inclusion of HornetQ plus users or the Apollo core in AMQ 6 would be a good step forward - will have to wait to see what happens.  Might put a few experiences with HornetQ on this blog just in case.