The alarm went off, I opened my blurry eyes and reached for my phone. Click… click… 03?!?!!?
I started looking. I log into my server from my phone, clicking away to get a status. The database engine is in a crashbackoff loop.
About that time, I noticed that Miguel had contacted me with a very polite 503? Whiskey Tango Foxtrot.
As I have talked about, I’m upgrading the infrastructure that GFZ uses. The previous round of downtime resulted in me opening tickets with Linode and escalating to the point where less than a week ago I got an update, “We resolved the issue you reported”. They had known about the issue for over a year. It just wasn’t important enough to fix until their client, me, raised a fuss.
One of the side effects of this upgrade process is that I’ve had to increase the number of nodes and the size of nodes. All of that is going well.
It is unclear to me why the database engine crashed, only that it did.
To that end, I have removed that database engine from production. Moved all the data to the larger, more stable, database engine. This database engine is using the new persistent (CEPH) storage engine. While it is not “crash proof” it is less prone to failures because of the way the data is now stored.
In addition, it is much easier to get backups of the data.
I’m going to take the plunge later today and move the assets from the storage it is currently using to the new storage system. This offers numerous benefits, not the least of which is that I can do rolling upgrades of the software.
Yesterday I upgraded ‘WordPress’ on multiple sites. With the new infrastructure being used by some of those sites, there was zero downtime. K8S started a new pod with the new software. When it was stable, it terminated one of the old pods. It then started a second pod with the new software. When it was stable, it terminated the last old pod. Zero downtime.
For GFZ, using the older infrastructure, the old pod was terminated, the new pod was started, once it was stable, service resumed.
Regardless, I’m hopping for a quiet day.
AWA, Thank You for what you do for all of us out here in the aether
I have done support for big and small clients as a contractor and employee. In almost all cases, management refuses to tell users/clients why something went wrong. It is easier for them to just say “It’s fixed” than to write a report to for the outside world to read.
.
I am currently fighting one long-term issue and on short-term. The long term is that I have to have a ReadWriteMany capability. It resolves many stability issues. The other issue is that having created a ReadWriteMany option, I am having nodes go OutOfMemory. When that happens, the node terminates pods. I upgraded all of my nodes to the next larger size but it is still happening. I am tuning the entire K8S to reduce the memory load. It is just painful and slow.