High availability is a concept that says you will have zero downtime.
Consider an old world situation. You have a server that is serving exactly one website. On that server you are running an Operating System, a database engine, a web server (Apache/Nginx), an interpreter, and a bunch of code and HTML.
In your browser (client) you type “http://www.awa-example.com”. This causes your computer to send a request to a DNS server to translate “www.awa-example.com” into an IP address. Dozens of computers working in a distributed way work to get that answer. Your browser then opens a TCP/IP connection to the address it was told to use.
When that connection finishes traveling across multiple different servers (routers) it arrives at my server. My server examines the packet and determines that it is addressed to the web server. The web server looks at the request and sees that the request is for www.awa-example.com. It looks through its configuration files and decides on which interpreter to use. It transfers the request to that interpreter.
The interpreter loads configuration files and loads the code to execute/interpret. That code runs and opens a connection to the database engine. The code makes a query to the database, the database returns a result, the code formats it and sends back a message to your browser, which displays it.
If any part of that long set of computers/servers and software fails, your browser doesn’t get an answer to display.
We have a “demark” that marks the point of responsibility. Anything outside the demark is “their” problem. Anything inside the demark, including the demark, is our issue.
What that means is that in a high availability system, there has to be at least two of everything. On the outside, we must have two “theirs” and two demarks. Linode provides us those multiple “theirs” and demarks. If one link into their data center dies, the others take up the load and everything continues as if nothing was at issue.
If we are worried about the data center, they offer data centers all around the world. We are happy with just one data center.
At their demark they send traffic to one of two “node balancers” we have purchased. These are in different racks. Each is capable of handling all traffic into our cluster. If they need to update the node balancer, they can update one, wait until it is up and running, then update the other. This is either software or hardware. They can physically turn off one rack, and we won’t even notice.
We use a cluster to support our clients. There are 6 nodes (servers) in our cluster. 4 are ours, two are theirs. Their nodes are used for the “control-plane”. This is what controls our cluster. When we tell our cluster to do something, it is the control-plane which orchestrates the other nodes.
We run two ingress pods. The node balancers send traffic to these pods directly to our nodes in a round-robin. If we need to upgrade our ingress, the cluster will create a new ingress pod, make sure it is up and running, then terminate one of the old ingress pods. It then launches another ingress pod, when that is up and running, it terminates the last old one.
There is NO downtime as this happens.
The ingress sends handles external SSL and internal traffic to services. The cluster receives traffic at the service and forwards that traffic to which ever pod is providing the service. If we run at least two pods, we will not have downtime. We set things up so that pods run on different nodes, if possible, so a node failure doesn’t take down all the pods.
Which brings us to the tail end of all of this. Our pod.
Our pod, in this case, is running WordPress. If the pods can mount a file system in ReadWriteMany then multiple pods can access the same files at the same time. WordPress has a directory of content. These are files that we upload, images, PDFs, videos, themes, and a boatload of other things. We want to have that directory accessible by all our WordPress pods.
We don’t have to worry about the database, that runs in its cluster. If one of the database engines dies, the others take over with no loss of function. We run a hot spare style of replication. We could use a multi-master version, it isn’t worth it at this time.
And this brings us to The Issue. Linode provides us with persistent volumes. This works perfectly for many situations. Unfortunately, those persistent volumes are ReadWriteOnly. This means that only one pod can access the files at a time.
Since there is only one pod, there is no redundancy. If that pod fails, the site goes down. If the node that the pod is running on fails, the pod fails, the site goes down.
On good days, when a pod fails, it is restarted and the replacement is up and running shortly thereafter. Downtime is low, but not zero.
Linode isn’t going to offer a ReadWriteMany anytime soon.
Which brings me to Ceph!
Ceph is a distributed block storage system with the ability to run a distributed file system on top of that block storage.
All I should need to do is deploy it to my cluster. Sure, if I want to buy 3 more nodes/servers and a bunch of disk space for them. Think $1000s of dollars per month.
But there is a version for Kubernetes called “rook”. It can even use persistent volumes. After a few days of fighting this on my local Kubernetes cluster, I finally got it working. All that was required was to deploy it to Linode.
12 hours of fighting and I finally got it mostly functional. Until I went to allocate block storage for Ceph. Linode doesn’t allow block persistent volumes! ARGH! I’m stopped.
Then around 2300 Monday I got it. I used the same volumes that Linode’s persistent volumes used and attached them directly to my nodes as block storage. Amazing! It works.
Today I got it all configured and running. I will be upgrading GFZ to be a HA site in the upcoming weeks.
The world is better.
Ummm, ookay, very good, nice work. Awa, now if you could make a digital time machine to transport us to the time of the Wild Wild West, I believe most of us would be on board. I’d like a window seat please.