Saturday, August 8, 2020

Trying out GraalVM

Interested in running Java a bit like running C/C++ or Go? I wanted to give it a try: GraalVM is a very interesting project for a few reasons - a new Java JIT, running a variety of languages on the JVM, and, most importantly for here, the native image capability. Using native images means programs for the JVM are compiled down to binaries just like C, C++, Go and other languages increasing performance, decreasing the size of the binary, and removing the need to install the JVM on the target container or computer.

Photo by American Public Power Association on Unsplash

Most important to me was the performance and efficiency - the promise of Java code that could run faster and/or more efficiently (Java already is efficient according to this study but still some ways to go to reach C level performance), but the other features are also appealing: a small, deployable binary, would Python code on GraalVM run faster than JPython did, memory savings, faster startup times, and so on.

The take away is that many of these are delivered, but not all at once or all equally. Performance of native images sometimes lagged the normal JVM significantly.

To get started, download the GraalVM. I used Java 11 rather than Java 8 although I later found that Java 11 has poorer performance vs Java 8 or the newest (at the moment) Java 14 (OpenJDK 14). Once downloaded and unpacked (tar xzvf ...), cd into the directory that the unpacked download was put into. Then set some environment variables:
export PATH=$PWD/bin:$PATH
which java; java -version
which gu
The last few commands are to check that you're accessing the right Java location and version and that the gu "GraalVM Component Updater" is available. The java -version produced output including this: OpenJDK Runtime Environment GraalVM CE 20.1.0.

Using this trivial Java code, let's give GraalVM a try - here's
public class SimpleGraalExample {
    public static void main(String[] args) {
        System.out.println("Hello from the simplest GraalVM Example");

Compile that with normal Java and run it:
date +%s%N; java SimpleGraalExample  
Hello from the simplest GraalVM Example

The date call returns the time since the epoch in seconds and then nanoseconds while the Java code returns the time in milliseconds. This gives us a rough way to see how long the jvm took to start up and run the first statement - 719ms - 474ms = 245ms in this case (there are things happening on the command line that will account for some of the time. I think there's a way to get the JVM to report the startup time so will update.)

To make a native image, you'll need to install a number of libraries and gcc: gcc, glibc-devel, zlib-devel, and libstdc++, then install the Graal native image component using gu:
gu install native-image

Next run:
native-image SimpleGraalExample
On my system, it took about 10s and used about 2GB of memory to produce the native image. Once done, you should have an executable file named simplegraalexample (all lower case - Linux/Unix standard). Run that:
date +%s%N; ./simplegraalexample  
Hello from the simplest GraalVM Example
The startup time has dropped to roughly 9ms. So, a single file executable with a fast startup time - definitely useful for containers that are set to autoscale under load especially if this shrinks start up times for larger applications that would normally take several seconds to load.

In this example, we've not really stretched the JIT aspect at all. Java has two tiers to the current JIT compiler - the C1 "quick and decent improvements" tier and the C2 "deeper optimizations that take longer" tier. Here's a description from the C++ JDK code:

 *  The system supports 5 execution levels:
 *  * level 0 - interpreter
 *  * level 1 - C1 with full optimization (no profiling)
 *  * level 2 - C1 with invocation and backedge counters
 *  * level 3 - C1 with full profiling (level 2 + MDO)
 *  * level 4 - C2

The idea behind the new JIT in Graal is to write it in Java so that it's easier to maintain and extend and one that avoids any memory issues of the current C++ versions.

To test the JIT and Graal's performance more, use the CountUpperCase example on the GraalVM site. I've added a call to System.currentTimeMillis() at the start.
date +%s%N; java CountUppercase What kind of Performance would anyOne expect from Graal and the new JIT compiler
1 (687 ms)
2 (199 ms)
3 (159 ms)
4 (158 ms)
5 (140 ms)
6 (142 ms)
7 (169 ms)
8 (151 ms)
9 (141 ms)
total: 69999993 (2096 ms)

The time for the first vs second and subsequent iterations shows the effects of compilation. If you want to see the compilation happen, add -Dgraal.PrintCompilation=true to the java execution above.

Run this again turning off the new JIT compiler and run with the default:
date +%s%N; java -XX:-UseJVMCICompiler CountUppercase What kind of Performance would anyOne expect from Graal and the new JIT compiler
1 (1051 ms)
2 (957 ms)
3 (954 ms)
4 (940 ms)
5 (943 ms)
6 (950 ms)
7 (942 ms)
8 (953 ms)
9 (958 ms)
total: 69999993 (9596 ms)

Below, I tried to "turn on" tiered compilation (C1 + C2) another way, but since this is the default behavior since Java 8, I'm not sure what it really did as the results look much less like "UseJVMCICompiler" and much more like the Graal JIT.

date +%s%N; java -XX:+TieredCompilation CountUppercase What kind of Performance would anyOne expect from Graal and the new JIT compiler
1 (602 ms)
2 (279 ms)
3 (194 ms)
4 (235 ms)
5 (165 ms)
6 (220 ms)
7 (154 ms)
8 (271 ms)
9 (168 ms)
total: 69999993 (2451 ms)

To stop at level 2 JIT, use a command like this:
java -XX:-UseJVMCICompiler -XX:TieredStopAtLevel=2 CountUpper...

To see the native image performance:
native-image CountUppercase
date +%s%N; ./countuppercase What kind of Performance would anyOne expect from Graal and the new JIT compiler
1 (1467 ms)
2 (1227 ms)
3 (1222 ms)
4 (1211 ms)
5 (1219 ms)
6 (1216 ms)
7 (1233 ms)
8 (1252 ms)
9 (1277 ms)
total: 69999993 (12585 ms)
An almost instant startup, but we've lost performance here - seemingly lots.

The native-image tool has a code profiling optimization that is only available in the enterprise version and a tracing agent that helps identify which classes will be used. The latter can be helpful for reducing time by instantiating classes at compile time. To use the tracing agent, run the code normally and exercise the code paths. For the simple, it's run it like this:
java -agentlib:native-image-agent=config-output-dir=./META-INF/native-image/ CountUppercase What kind of Performance would anyOne expect from Graal and the new JIT compiler
This will put four files into META-INF/native-images. In this case, the files were close to empty, so I knew this wouldn't make much of a difference. However, in a larger application, the trace will help identify class usage better than static analysis.
Now, run:
native-image -cp ./META-INF/ CountUppercase
date +%s%N; ./countuppercase What kind of Performance would anyOne expect from Graal and the new JIT compiler
1 (1364 ms)
2 (1177 ms)
3 (1168 ms)
4 (1174 ms)
5 (1154 ms)
6 (1163 ms)
7 (1152 ms)
8 (1165 ms)
9 (1209 ms)
total: 69999993 (12008 ms)
Unfortunately, not much of a gain at all.

While this runs slower, tests with a simple web app (using HTTPServer) show very little difference in performance - faster or slower suggesting that for APIs and Spring boot apps, native image would have little impact. (Will add code...)

Testing the Java scimark program, here are the steps and results (javac and javac -O produced similar results as you'd expect if the JIT rather than javac was doing most of the optimization):
javac jnt/scimark2/
time java -classpath . jnt.scimark2.commandline
SciMark 2.0a
Composite Score: 1050.056478100177
FFT (1024): 1036.3878030994783
SOR (100x100):   857.5678798364443
Monte Carlo : 408.1116711801417
Sparse matmult (N=1000, nz=5000): 1338.152124290685
LU (100x100): 1610.0629120941362
java.vendor: GraalVM Community
java.version: 11.0.7
os.arch: amd64 Linux
os.version: 5.7.8-100.fc31.x86_64
real    0m31.560s
user    0m31.959s
sys    0m0.098s

Using the -UseJVMCICompiler option to turn off the new compiler:

time java -XX:-UseJVMCICompiler -classpath . jnt.scimark2.commandline
SciMark 2.0a
Composite Score: 1442.2801308185528
FFT (1024): 914.6516464184309
SOR (100x100):   1068.4530497649632
Monte Carlo : 664.6498065711132
Sparse matmult (N=1000, nz=5000): 1135.3139812371085
LU (100x100): 3428.3321701011487

Again trying with the +TieredCompiler option was more in line with the GraalVM JIT results:
time java -XX:+TieredCompilation -classpath . jnt.scimark2.commandline
SciMark 2.0a
Composite Score: 1053.134652575841
I suspect that it's not turning off the new JIT.

Creating a native image and running it. First, add a manifest file, create a jar and then run native-image:
Main-Class: jnt.scimark2.commandline
jar cmvf META-INF/MANIFEST.MF scimark2.jar jnt/scimark2/*.clas
java -jar scimark2.jar
native-image -jar scimark2.jar commandline # wait about 40s to run

SciMark 2.0a
Composite Score: 665.0761252118398
FFT (1024): 678.1650940478376
SOR (100x100):   877.8838572232665
Monte Carlo : 31.565788470303204
Sparse matmult (N=1000, nz=5000): 695.5266550424169
LU (100x100): 1042.2392312753743
java.vendor: Oracle Corporation
java.version: 11.0.7
os.arch: amd64 Linux
os.version: 5.7.8-100.fc31.x86_64

Again, the performance here has dropped (also note the change in java vendor). While investigating this, I came across this comment from adinn which I could summarize as: "why would you expect the static compiler to produce code as fast as the JIT?" He carries on with an excellent explanation. However, the reason I'd expect it to run as fast or faster than JIT is that C/C++ code with the static compiler based optimizers run very fast (faster than Java). Undoubtedly, there is more performance available, but overall, the ability to make native images (no JVM installation in the production system), fast start ups, run other programming languages, and the potential for the new (Java based) compiler make GraalVM a very interesting project.

I haven't added stats for running GraalVM against a project that combines with large frameworks like Spring. I'll add that later. However, this is another place that Graal shines - by evaluating code paths and throwing away unnecessary code, the binaries are much smaller and start up much faster (as above) than normal.

Wednesday, May 27, 2020


What are Quaternions and why would anyone care?

Quaternions are the next step up in complex numbers. Complex numbers are real numbers with an 'imaginary' part. 'imaginary' is in quotes as that's what we call it, but it's not fiction, it's the square root of -1, often represented by i (or j in electrical engineering). Written differently i2=-1. In early days, people considered this strange and useless so imaginary is fitting. By the way, if you go back far enough people didn't see any point in 0 either.

A real number is represented by a - any real number like 2, 1.5781739, 10000, etc. A complex number is represented by a + bi where a and b are real numbers and i2=-1. Complex numbers are very good at represented rotations (as in a circle around 0) which is why they're important in many scientific and engineering activities.

Quaternions are the next step up: a + bi + cj + dk or a0 + ia1 + ja2 + ka3. Here i2 = j2 = k2 = -1. All good so far, but it gets more complex in that ij = k = -ji, jk = i = -kj, and ki = j = -ik. That leads to ijk = -1 which can be seen from ij=k and k*k = -1. The fact that ij = k and not -1 is a little confusing, but these are more than just the square root of -1, but directions in 'quaternion space'. What these look like is the outcome of a standard vector cross product of i x j = k and j x i = -k.

What's interesting about quaternions is that they have almost all the properties of real and complex numbers in terms of addition, subtraction, multiplication, division, and inverses, except that multiplication is not commutattive. In other words, for normal numbers a*b = b*a, but as we saw for quaternions ij = -ji.

In terms of their use, quaternions are useful for rotations in higher dimensions that planes where complex numbers are good. In many ways the 4 dimensional approach resembles relativity theory where time plus the three spatial coordinates are linked or common differential equations where time and space are linked - especially in Schrodinger's equation where time and space are linked via a multiplier of i. For fun purposes, quaternions can replace complex numbers to create higher dimensional Mandelbrot sets. Don't stop there - there are also Tessarines another 4-dimensional complex number were j2 = 1 and Octonions - an 8th dimensional complex number.

The story is that Hamilton was trying to solve a difficult problem and came up with the idea of triplets of complex numbers while out on a walk with his wife by the Royal Canal in Dublin. Supposedly, he was so struck by the idea that he carved i2 = j2 = k2 = -1 and ijk = -1 into the stone of a bridge.

Sunday, February 16, 2020

Other Useful Linux Commands

A collection of other bash and Linux commands ... or solutions

There's obviously no point in detailing all of the Linux/Unix/Bash commands available. Here are a few commands or solutions, I wanted to remember.

Get motherboard or DIMM info:
dmidecode reads information from the DMI (desktop management interface) table which is closely related to the SMBIOS (system management BIOS). Need sudo to run:

dmidecode -t 4 # for CPU info
dmidecode -t 2 # for motherboard info
dmidecode -t memory # for all memory
dmidecode -t 17 # for sodimm information
dmidecode -t 16 # for motherboard info on memory 

There's also lshw, but it wasn't installed on my system.
Related commands: lspci, lsusb, lscpu, lsscsi, lsblk

(Other ls* commands of interest: lsmem, lslocks, lsns, lsipc, lslogins)

Weather on the command line:

While loop failing:
I had a problem where I wanted to check ssh on a number of hosts. Simple - cat the file, pipe into while, use ssh with timeout to do the check like this:
cat host_list.txt | while read h; do timeout 3 ssh $h; done

It didn't work. It just hung on the ssh command until timeout cut the command. ssh seemed to be grabbing standard input (and maybe stdout) and interfering with the while loops input. Switching to a for loop helped with the input. Then ssh didn't seem to be the best choice nor did telnet 22. Netcat in a for loop worked best:
nc -v $host 22 

Problem solved and the hosts with ssh running were found....

ssh starting remote processes:
Here's another one that might be easy or not... With ssh, you can run a remote command:
ssh "ls -l"
What if you want to start a remote process and want it to keep running - something like this:
ssh "nohup java example.jar &"

The problem with the above is that it often won't work - the remote service won't be running. This mostly has to do with terminal control. Despite nohup saying that it is redirecting output to nohup.out, there's still a problem. Two easy solutions:
ssh "nohup java example.jar > ./output_file.out 2>&1 &"

This grabs both standard out and standard error and puts them into output_file.out. Of course, another option is to create a systemd (or init.d) script and use that.

Specifically, inotifywait which will watch file system objects for events that you specify.
inotifywait -r -e access,modify /var/log
This will watch for access or modification updates on files in /var/log.
Combine inotifywait with a while loop as in
while inotifywait -e access /var/log
    some shell commands here

The only downside to inotifywait is that inotify-tools often needs to be installed (yum/dnf install or apt install).

Monday, May 27, 2019

Quantum Computing

 Spinning coin - curvature is due to CMOS rolling shutter
One of the biggest conceptual and architectural changes to computing is quantum computing - using the nature of quantum mechanics to change the way calculations are done. Modern computers already use quantum mechanics in some fashion due to semiconductors. However, that use is mainly to create logic gates that mirror the gates created with vacuum tubes and relays. In a quantum computer, the logic directly relies on important features from quantum mechanical probabilities: superposition and entanglement of states. Quantum computers are believed to be able to quickly solve problems that the fastest modern computers struggle to solve - the Quantum Supremacy (of quantum computers over classical computers for a type of problem). Theoretical work shows that widely used asymmetric key cryptography would be easily broken meaning secure data like credit card or password information would be insecure. There are limits as quantum computers won't solve problems that normal computers cannot solve now. Richard Feynman, the noted physicist, first mentioned quantum computers in 1959 and in the 80s, Paul Benioff and Yuri Mann developed the ideas for these computers.

Some Background 

Probability is the likelihood of something happening. Flip a coin and the probability of it landing heads up is 1/2. That's the same probability for landing tails up. Added together, the probability for heads or tails is 1 (assuming that the coin won't land and remain stationary on its edge). We think of probabilities as values between 0 and 1. 0 is no chance of an event occurring while a value of 1 indicates it will occur. Usually, the value is somewhere in between. With quantum mechanics, this is more complicated - probability in these systems comes from squaring the probability amplitude. That amplitude can be positive or negative and combining amplitudes that are positive or negative will create regions of stronger or weaker amplitude (it's more interesting as the amplitudes are complex numbers). Square those modified amplitudes to get probabilities. This is how atoms bond together in molecules, liquids or solids - the electrons (quantum mechanical wave-particles) of an atom can have positive and negative amplitudes that combined with electrons of other atoms create bonding and anti-bonding combinations. In mathematical terms, classical probability is like:
Sum(p_i) = 1
where each p_i >= 0.
In quantum mechanical terms, probability is like this:
Sum(|a_i|^2) = 1
where each a_i is a complex number and can be negative

Superposition is an important concept in understanding quantum mechanical systems. It's the concept behind Schrodinger's Cat if you've heard of that. A quantum mechanical system can be composed of multiple states all at the same time. It's like a spinning coin - heads, tails, heads, tails. Only when you measure the system/state or check the coin do you get a definite answer. Measurement causes a collapse of the superposition of states into one state. Before measuring, the 'spinning' or unchecked state of the coin could be seen like this:
|Total State> = (|head> + |tail>)/sqrt(2)
The probability of that state is the square (eliminate the head-tail situation):
1/2Heads + 1/2 Tails = 1 (as the total probability needs to be 1).  
While the measured state would be, if the coin is heads up:
|Total State> = |head>
with probability being 1 and Heads = 1.

Entanglement is when two quantum systems combine so that again there is a way of describing the total system as a sum of the two. Imagine that we had two coins and each must show the opposite of the other; if one is heads, the other has to be tails. These two coins would be entangled - while the coin analogy is simplistic, this does happen with electrons and something called spin. Two electrons can be in the same 'state' along as one is spin up and the other is spin down. (I put 'state' in quotes as the total state of each electron actually includes the spin.) If you measure one of the coins, if you check if it's heads or tails, then you instantly know what the other coin is. Einstein referred to this as a 'spooky action at a distance' (see a description of EPR). Recent experiments have shown that it's true for particles separated by hundreds of miles but are still entangled. Entropy (information theory entropy) can also be used as a measure of entanglement (the more entropy then more entanglement). To say this another way: for entangled think correlated in probability and that outcomes are related. If two probabilities are uncorrelated, then the outcome is simply the combined probability (P_coin1(heads) times P_coin2(tails)). Correlated outcomes aren't that simple and can't be decomposed like that. Entanglement makes it an inseparable whole. If heads is 1 and tails is 0, then the two entangled states would be: |01>+|10>.

If this is a little confusing, Feynman's comments: "safely say that nobody understands quantum mechanics." Which probably isn't true as Feynman understood it, but it's not easy to understand. The important part is that quantum mechanics allows parts of a system to interact in such a way that 'calculations' by the system can use the entangled probabilities to reach a solution much faster than a classical computer.

Quantum Computing

Normal computers have bits, 0s and 1s, in registers or memory and use these to perform calculations with normal digital logic gates. These 0s and 1s are held in memory with electricity and have to be bolstered frequently to keep them at 0 or 1. In a quantum computer, the qubit is the basic bit. As above, the qubit doesn't hold a 0 or 1, but holds a superposition of 0 and 1 - it holds those amplitudes mentioned above. Having a larger number of qubits means that they can be entangled. Entangling them means the group of qubits can hold all possible states at once. Two qubits can hold the states 00, 01, 10, 11 all at once. A classical computer would have to hold these as 4 separate states, requiring more bits. Ultimately, the number of possible solutions that a quantum computer can keep at once scales as 2^n where n is the number of qubits. A normal computer would need 2^n memory for the same number of states. A quantum computer with two qubits allows a superposition of 4 states, three qubits allows 8 states, ..., n qubits allow 2^n states. Qubits hold the state of the interaction while it reduces the memory required.

The result of a quantum calculation with n qubits would only return an n 'bit' answer even though the calculation ran over 2^n bit space. Again, with n bits in a classical computer, n amount of information can be encoded; a quantum computer will span 2^n space, but only return n bits of information.
This rapid increase in possible states and the ability to cover all of the states is what leads to the potential speed up in quantum computers.

Calculations are done by 'loading' the information into the qubits, influencing their state, much like setting the state of memory in a normal computer even if the method is very different. Depending on the type of quantum computer, this loading is done with microwaves or lasers. Operations are reversible and operate on the quantum mechanical analog of a normal n-bit register; classical computer gates are not reversible - you can't know the input from the output. (In normal computers, the NOT gate is reversible; also normal computers could be built with Toffoli gates which can be reversed using extra memory.) One of the most well-known quantum computing gates is the Hadamard gate which transforms specific states into a superposition of states and back again. As an example, in a classical computer running a probability calculation (random/stochastic/probability/transition matrices), the goal is to preserve that the sum of the probabilities = 1. In a quantum computer, operations like the Hadamard gate (unitary matrices) preserve that sum of squares of the amplitude = 1. Positive and negative amplitudes combine to increase or decrease the probabilities of some outcomes while keeping the overall probability of an outcome correct. The entanglement of the qubits means that affecting one affects all - in the two coins example above, setting one coin to heads meant the other was tails. Another analogy is pulling a red ball out of a bag of red and green balls - a normal computer would pull one, but quantum entanglement means that pulling one is like pulling two or more out.

Again, the quantum computer qubits hold all values until measured; a classical computer would hold one value in the same number of bits. Measurements are then done (again with microwaves or lasers) to 'collapse' the 2^n states into one that is the result with n bits of info (the original number of qubits). Measurement is not reversible.

Issues with Quantum Computers
Due to the nature of quantum computing, the outcomes are probabilistic - only provide a valid solution with some probability. This is true in normal computers, too, but the effort over decades of work has been to reduce the error to near 0 so that we trust the outcome. Almost no one thinks about error correction when using a normal computer these days. Error correction in quantum computing is in its infancy. The computers have to be kept super cold and isolated as much as possible as any noise could affect the calculation. Realistically, calculations would need to be run several times to make certain of the result. The errors aren't just value errors as in a classical computer, but errors in the way the qubits interact. While the quantum computer is built to have its qubits interact with each other as much as possible (entanglement), it's also built to keep it from interacting with the environment as little as possible! The noise can cause decoherence which means the qubits stop interacting as one group. Environmental issues like heat, noise, electromagnetic or other radiation all work against the system. The redundancy of calculations for error corrections have been estimated to be as high as hundreds, thousands or even millions of times; fixing this is a key area of research. A recent advance has reduced the number of qubits needed for cracking 2048 PKI and overcome errors from billions to millions. Scaling the lifetime of the qubits before decoherence - IBM in 2017 achieved 90 microseconds for 50 qubits.

Scaling quantum computers is another challenge not just to eliminate environmental factors. They need to be scaled in coherence time and number of qubits as well as programming techniques. A small number of qubits easily fit close to each other, but when the number of qubits grows very large, distances from each other become an issue. In terms of programming, the same challenge facing classical computers exists - simple hardware with complex languages or complex hardware with simple languages (or the worst of both). The difficulty with quantum computers is the complexity of understanding the operations. A computer with n qubits might require n^3 logic 'gates' or operations to achieve the results.

Current Work

There are two main types of quantum computers now: ion/atom traps and superconducting circuits. Ion traps use lasers to write, read and trap the ions suspended in the electromagnetic fields. The ions in the computer interact in a way that you could think as vibrating together. Similarly, atoms can be trapped in an optical lattice.

The other main type is superconductor based. These are often modeled on superconduction quantum interference devices (SQUIDs) small rings of current. Oscillating microwaves to set and read values. The devices are connected together via radio frequency channels which could be part of a printed circuit board. A company called D-Wave have been working on quantum computers for years. They structure the computer to solve the problem so requires almost no code; their computers don't connect all qubits together and can't solve all types of problems. Their computers resembling annealing systems using transverse tunneling fields (analogous to temperature in real annealing). Google and IBM are working on more generic computers that link all qubits. In 2017, IBM has produced systems with up to 50 qubits. In 2018, Google has raised that to 72 qubits. Compared to classical computers doing simulated annealing or quantum Monte Carlo calculations, Google says quantum annealing is 100M times faster. IBM offers a test quantum computing environment if you want to sign up.

Quantum supremacy is when a quantum computer is able to solve a classical computer science problem faster than a supercomputer (or any classical computer). It's uncertain how large a quantum computer will have to be to achieve this, but it's expected soon by some, others doubt it will happen. To be clear, quantum computers will solve the same types of problems that classical computers can, but they'll solve them much faster. They're both Turing machines. Those problems a classical computer can't solve like the halting problem or NP-complete problems will be unsolvable by a quantum computer, too. In acronyms, the quantum computer is BQP bounded error, quantum, polynomial time set thought to be bigger than BPP bounded error, probabilistic polynomial time set. In particular, quantum computers should be good at factoring large numbers, calculating optimized paths (like courier routes), quantum search (Grover's algorithm), and physics and chemistry problems around the quantum nature of the universe as nature isn't 'classical' it's quantum mechanical.

With the ability to factor large numbers quickly, quantum computers will crack traditional cryptography easily. In 1994, Shor developed an algorithm that can quickly factor large numbers or discrete logarithms. This algorithm based on  modular exponentiation and quantum Fourier transforms can find the factors of public key cryptography (PKI for public key infrastructure) and render PKI insecure. In 2016, NIST requested submissions of quantum-resistant cryptography algorithms. Currently, 26 algorithms are in the later stages of evaluation. The key weakness of PKI is the asymmetric key portion which relies on large prime numbers or discrete logarithms being difficult for classical computers to factor. Symmetric keys are much more resistant to quantum computing, but Grover's algorithm do make them susceptible - thankfully, doubling the symmetric key size is often enough in a post-quantum world.

Some approaches to post-quantum cryptography include one time keys (hashes or OTP), simultaneous/multivariate equations, lattice-based and error correcting codes. All rely on the difficulty of decoding the information without all the input or the time it takes to decode. Current PKI also relies on the hardness of solving problems. It's possible a classical computer can crack a PKI key as well, but the algorithms have been designed to make it take thousands of years.

The drive to improve quantum computers is growing as being the first with a viable quantum computer is important financially and strategically. Microsoft has entered the race with Google, IBM, D-Wave and researchers around the world. There's a course to learn to program for quantum computers from Microsoft and Google, IBM has a system online for learning more,  Google has it's own playground and scripting language, Microsoft has announced a language Q#, and researchers at labs are driving this forward.

Sunday, April 28, 2019

Sizes of Large Directories with Gluster, SSHFS, NFS

This post covers a few things: checking the number of entries or rough sizes of a directory, a look at the behavior of Gluster, NFS, SSHFS, and SFTP for remote directory sizes, and some info on mounting remote file systems using NFS or SSHFS (Gluster is a topic for another day).

First, how to check the rough size of a directory. We know that checking the number of files in a large directory locally can be slow. Doing that check over a remote mount like Gluster can be much slower and even cause the Gluster mount to crash.
Generally, under Linux/Unix, you can get a rough estimate of the size of a directory by looking at the output of ls -l in the parent directory of that directory. For example, let's use a test directory to check behavior: ~/test. We've added two subdirectories dir_small with no files in it and dir_large with 5000 files in it. ls -l ~/test gives:
ls -l ~/test
drwxrwxr-x. 2 jm jm 143360 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

Here dir_small has the smallest size possible and dir_large is larger due to the 5000 files in it. Remember that in Linux/Unix, a directory is just a special type of file that keeps information about the list of files in that directory. 

In order to count the files or just a guide as to which is the largest directory, the safest options are:
ls -l ../ # only a rough guide, check parent directory to see directory 'size'
find dir_large -type f -print | wc -l # lists files one by one
echo * | awk -F ' ' '{print NF}' # provides the list of files in one line
ls | wc -l # performs more ops than echo * (as checked by strace)
If you think you have a large directory, don't do ls -l as that will stat each file requiring far more ops and time.
I point all of this out as this was tried using Gluster with some issues. In case you don't know, Gluster or GlusterFS is a clustered file system. We've used it for stand alone clusters within our data centers (DC). It can do cross DC replication although we've found too many problems so don't use that feature.

We had an issue the other day with a system mounting a Gluster volume and we wanted to check on files in a large collection of directories.

Interestingly, Gluster's directory sizes don't show the usual file size info. For the same directories as above, it showed
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

Running other commands like ls and find can be painful with a large Gluster volume. Some have given it a reputation for being almost unusable for certain interactive operations like ls, find, du, etc. Ultimately, the solution was to use find and it's one file at a time checking; the alternative is to run this commands on the Gluster server rather than the client system.

Since Gluster has a different meaning for directory size, we thought we'd see whether that is Gluster or the FUSE system that it uses. Interesting, the Gluster brick, the file system being exported for remote mounting, shows the normal directory size. That size would be expected as it's a normal Linux file system. We also tried NFS which passes through the usual directory size:
drwxrwxr-x. 2 jm jm 143360 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

NFS doesn't use FUSE though so this only confirms that normal directory sizes are passed through with other remote file systems. 

We also tried SSHFS which uses FUSE for mounting. Just like NFS, SSHFS showed the right directory sizes:
drwxrwxr-x. 2 jm jm 143360 Apr 26 16:41 dir_large
drwxrwxr-x. 2 jm jm   4096 Apr 26 16:56 dir_small

By the way, SFTP also showed the normal directory sizes, but you might assume that by now.

So, this looks like Gluster is the issue here and has chosen to re-interpret the meaning of a directory size.

Information on mounting NFS and SSHFS file systems
Instructions for SSHFS:
SSHFS isn't installed on all systems. On my test system (Fedora 29), I needed to install with
sudo dnf install fuse-sshfs
# make a mount point
mkdir ~/test/mount_point
# to mount, do:
sshfs remote_user@remote_host:/remote_directory ~/test/mount_point
# to unmount, do:
 fusermount -u ~/test/mount_point

Instructions for NFS:
#edit the export file and add your export directory
vi /etc/exports
# add  /home/jm/test localhost(ro) *.local.domain(ro)
#start nfs-server
sudo systemctl start nfs-server
sudo exportfs -a
# check the exported file system is available from the local or client system:
showmount -e test-nfs-server.local.domain
showmount -e localhost
# make a mount point
mkdir /tmp/mount_point
sudo mount -t nfs test-nfs-server.local.domain:/home/jm/test /tmp/mount_point
 #to unmount, use the usual umount:
sudo umount /tmp/mount_point
# on the nfs server, un-export the fs:
sudo exportfs -ua
#stop nfs if you want:
systemctl stop nfs-server

Monday, April 22, 2019

MySQL Galera Split Brain

The Galera cluster option for MySQL is one advantage MySQL has over Postgres. The clustering allows high availability and good performance. However, it's not without its issues and one or two of those is the split-brain problem. There can be two different kinds of split brain: poorly configured clusters that can't achieve a quorum for the split you're worried about and, more rarely, an update issue with the nodes which is interesting if frustrating.

MySQL Galera clusters have worked well and provided good uptime. The normal configuration is to have more active nodes in the primary location or data center. In the event of the link between the primary and secondary sites failing, the primary cluster should continue to run. It will continue running provided it has the majority of quorum votes on its own. It is important, therefore, to make sure that the quorum is achievable without the secondary location. One option is to keep the number of active nodes higher in the primary location. Another is to adjust the pc.weight of each node to make sure that the weight is larger in the primary location. See the Galera docs about setting the weight of a node. Either of these options makes the primary location safe from failures of the other locations, but still presents a problem if the primary data center has an issue. The remaining option is to use a 3rd location or witness to break ties or provide a quorum - you could do that with your own software, with a full set of servers in a 3rd location or use Galera's own solution. Galera's solution is garbd, the Galera Arbitrator, which acts as a witness or voting system when you only really have two main locations.

The second split brain issue is more interesting - i.e. it isn't a simple configuration or quorum issue. In this one, a Galera cluster shuts down on its own after detecting an issue. All of the active nodes except one would report something like this:

 "Duplicate entry 'entry_value_being_inserted' for key 'Key_for_column', Error_code: 1062 "
It might include "handler error HA_ERR_FOUND_DUPP_KEY" as well.

The issue here is that the Galera replication has pushed updates to every node. The replication pushes the change synchronously, but applies asynchronously - flow control is used to prevent nodes from getting too far behind - see the Galera docs (and first sentence) ('commits asynchronously' from these Galera docs). As you'd expect, it's RBR - row-based replication, not statement-based. What happens here is a case of a node falling behind. Each of the other nodes see an inconsistency and ,in order to protect the cluster, shut themselves down. Unfortunately, this could be every node shuts down except the one lagging, inconsistent node. With only one node active, Galera will realize it dosen't have the majority of votes to maintain the cluster and shuts the remaining node down. In order to recover the cluster, you need to find the last node that was running and start it with the bootstrap option. Then start every other node as normal. This issue doesn't happen often based, but it's good to understand it and how to recover it when it does happen. By the way, there are some related issues that could do the same so see this link for how to recover: this for fixing this and related issues.

Saturday, April 20, 2019

ActiveMQ Network of Brokers Again

Network of Brokers with ActiveMQ
In the past, I've written about the ActiveMQ network of brokers and some of the issues that can come up (see for example:

That was some years ago, so what is it like now? Overall, it's very good and reliably works between different data centers over a WAN link. It also provides for an important use case: the ability to write to one location and read anywhere. This is write-read is done via the 'local' broker to the apps so that if both the producer and consumer are in the same location, it's done 'locally' without having to commit to all brokers (or a quorum) everywhere first. That fact can have important performance benefits. However, there does seem to be a limit to the number of queues that can be bridged across a network of brokers - I don't know the number or what affects it, but assume a few hundred at most. The good news is that you can have a number of network connectors between two brokers and that means you can have much more than 100s of queues bridged between the same pair of brokers. The key is to create new networkConnectors as needed to handle groups of related (or similarly named) queues or topics. Here's an example of different groups of queues per networkConnector within the networkConnectors part of the ActiveMQ configuration xml:

Here's an example of the network connector config:
        <networkConnector name="QUEUE_SET_1" duplex="true" uri="static:(tcp://192.x.x.x:61616)">
                        <queue physicalName="stock.nasdaq.>" />
        <networkConnector name="QUEUE_SET_2" duplex="true" uri="static:(tcp://192.x.x.x:61616)">
                        <queue physicalName="stock.nyse.>"/>

Splitting out groups of queues like this allows the brokers to bridge more queues (or topics, the same applies for them) between the brokers without any issues. For another example, look at the Duplex Connector example on the ActiveMQ docs or in this article.