Sunday, April 28, 2019

Sizes of Large Directories with Gluster, SSHFS, NFS

This post covers a few things: checking the number of entries or rough sizes of a directory, a look at the behavior of Gluster, NFS, SSHFS, and SFTP for remote directory sizes, and some info on mounting remote file systems using NFS or SSHFS (Gluster is a topic for another day).

First, how to check the rough size of a directory. We know that checking the number of files in a large directory locally can be slow. Doing that check over a remote mount like Gluster can be much slower and even cause the Gluster mount to crash.
Generally, under Linux/Unix, you can get a rough estimate of the size of a directory by looking at the output of ls -l in the parent directory of that directory. For example, let's use a test directory to check behavior: ~/test. We've added two subdirectories dir_small with no files in it and dir_large with 5000 files in it. ls -l ~/test gives:
ls -l ~/test
drwxrwxr-x. 2 jm jm 143360 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

Here dir_small has the smallest size possible and dir_large is larger due to the 5000 files in it. Remember that in Linux/Unix, a directory is just a special type of file that keeps information about the list of files in that directory. 

In order to count the files or just a guide as to which is the largest directory, the safest options are:
ls -l ../ # only a rough guide, check parent directory to see directory 'size'
find dir_large -type f -print | wc -l # lists files one by one
echo * | awk -F ' ' '{print NF}' # provides the list of files in one line
ls | wc -l # performs more ops than echo * (as checked by strace)
If you think you have a large directory, don't do ls -l as that will stat each file requiring far more ops and time.
I point all of this out as this was tried using Gluster with some issues. In case you don't know, Gluster or GlusterFS is a clustered file system. We've used it for stand alone clusters within our data centers (DC). It can do cross DC replication although we've found too many problems so don't use that feature.

We had an issue the other day with a system mounting a Gluster volume and we wanted to check on files in a large collection of directories.

Interestingly, Gluster's directory sizes don't show the usual file size info. For the same directories as above, it showed
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

Running other commands like ls and find can be painful with a large Gluster volume. Some have given it a reputation for being almost unusable for certain interactive operations like ls, find, du, etc. Ultimately, the solution was to use find and it's one file at a time checking; the alternative is to run this commands on the Gluster server rather than the client system.

Since Gluster has a different meaning for directory size, we thought we'd see whether that is Gluster or the FUSE system that it uses. Interesting, the Gluster brick, the file system being exported for remote mounting, shows the normal directory size. That size would be expected as it's a normal Linux file system. We also tried NFS which passes through the usual directory size:
drwxrwxr-x. 2 jm jm 143360 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

NFS doesn't use FUSE though so this only confirms that normal directory sizes are passed through with other remote file systems. 

We also tried SSHFS which uses FUSE for mounting. Just like NFS, SSHFS showed the right directory sizes:
drwxrwxr-x. 2 jm jm 143360 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

By the way, SFTP also showed the normal directory sizes, but you might assume that by now.

So, this looks like Gluster is the issue here and has chosen to re-interpret the meaning of a directory size.

Information on mounting NFS and SSHFS file systems
Instructions for SSHFS:
SSHFS isn't installed on all systems. On my test system (Fedora 29), I needed to install with
sudo dnf install fuse-sshfs
# make a mount point
mkdir ~/test/mount_point
# to mount, do:
sshfs remote_user@remote_host:/remote_directory ~/test/mount_point
# to unmount, do:
 fusermount -u ~/test/mount_point

Instructions for NFS:
#edit the export file and add your export directory
vi /etc/exports
# add  /home/jm/test localhost(ro) *.local.domain(ro)
#start nfs-server
sudo systemctl start nfs-server
sudo exportfs -a
# check the exported file system is available from the local or client system:
showmount -e test-nfs-server.local.domain
showmount -e localhost
# make a mount point
mkdir /tmp/mount_point
sudo mount -t nfs test-nfs-server.local.domain:/home/jm/test /tmp/mount_point
 #to unmount, use the usual umount:
sudo umount /tmp/mount_point
# on the nfs server, un-export the fs:
sudo exportfs -ua
#stop nfs if you want:
systemctl stop nfs-server

Monday, April 22, 2019

MySQL Galera Split Brain

The Galera cluster option for MySQL is one advantage MySQL has over Postgres. The clustering allows high availability and good performance. However, it's not without its issues and one or two of those is the split-brain problem. There can be two different kinds of split brain: poorly configured clusters that can't achieve a quorum for the split you're worried about and, more rarely, an update issue with the nodes which is interesting if frustrating.

MySQL Galera clusters have worked well and provided good uptime. The normal configuration is to have more active nodes in the primary location or data center. In the event of the link between the primary and secondary sites failing, the primary cluster should continue to run. It will continue running provided it has the majority of quorum votes on its own. It is important, therefore, to make sure that the quorum is achievable without the secondary location. One option is to keep the number of active nodes higher in the primary location. Another is to adjust the pc.weight of each node to make sure that the weight is larger in the primary location. See the Galera docs about setting the weight of a node. Either of these options makes the primary location safe from failures of the other locations, but still presents a problem if the primary data center has an issue. The remaining option is to use a 3rd location or witness to break ties or provide a quorum - you could do that with your own software, with a full set of servers in a 3rd location or use Galera's own solution. Galera's solution is garbd, the Galera Arbitrator, which acts as a witness or voting system when you only really have two main locations.

The second split brain issue is more interesting - i.e. it isn't a simple configuration or quorum issue. In this one, a Galera cluster shuts down on its own after detecting an issue. All of the active nodes except one would report something like this:

 "Duplicate entry 'entry_value_being_inserted' for key 'Key_for_column', Error_code: 1062 "
It might include "handler error HA_ERR_FOUND_DUPP_KEY" as well.

The issue here is that the Galera replication has pushed updates to every node. The replication pushes the change synchronously, but applies asynchronously - flow control is used to prevent nodes from getting too far behind - see the Galera docs (and first sentence) ('commits asynchronously' from these Galera docs). As you'd expect, it's RBR - row-based replication, not statement-based. What happens here is a case of a node falling behind. Each of the other nodes see an inconsistency and ,in order to protect the cluster, shut themselves down. Unfortunately, this could be every node shuts down except the one lagging, inconsistent node. With only one node active, Galera will realize it dosen't have the majority of votes to maintain the cluster and shuts the remaining node down. In order to recover the cluster, you need to find the last node that was running and start it with the bootstrap option. Then start every other node as normal. This issue doesn't happen often based, but it's good to understand it and how to recover it when it does happen. By the way, there are some related issues that could do the same so see this link for how to recover: this for fixing this and related issues.

Saturday, April 20, 2019

ActiveMQ Network of Brokers Again

Network of Brokers with ActiveMQ
In the past, I've written about the ActiveMQ network of brokers and some of the issues that can come up (see for example:

That was some years ago, so what is it like now? Overall, it's very good and reliably works between different data centers over a WAN link. It also provides for an important use case: the ability to write to one location and read anywhere. This is write-read is done via the 'local' broker to the apps so that if both the producer and consumer are in the same location, it's done 'locally' without having to commit to all brokers (or a quorum) everywhere first. That fact can have important performance benefits. However, there does seem to be a limit to the number of queues that can be bridged across a network of brokers - I don't know the number or what affects it, but assume a few hundred at most. The good news is that you can have a number of network connectors between two brokers and that means you can have much more than 100s of queues bridged between the same pair of brokers. The key is to create new networkConnectors as needed to handle groups of related (or similarly named) queues or topics. Here's an example of different groups of queues per networkConnector within the networkConnectors part of the ActiveMQ configuration xml:

Here's an example of the network connector config:
        <networkConnector name="QUEUE_SET_1" duplex="true" uri="static:(tcp://192.x.x.x:61616)">
                        <queue physicalName="stock.nasdaq.>" />
        <networkConnector name="QUEUE_SET_2" duplex="true" uri="static:(tcp://192.x.x.x:61616)">
                        <queue physicalName="stock.nyse.>"/>

Splitting out groups of queues like this allows the brokers to bridge more queues (or topics, the same applies for them) between the brokers without any issues. For another example, look at the Duplex Connector example on the ActiveMQ docs or in this article.