It is 2100 and after 6 hours of working with our cloud provider, everything is back.
There was a hardware glitch that caused a node to fail. The website automatically moved to a new node and attempted to restart. Unfortunately, that hardware glitch caused the cluster to believe that the node was still there and still working. Since it was there and working, none of the resources (disk space) used by GFZ was released.
Because the resource did not release, the website on the new node would not start.
Linode took 8 calls from me, 22 ticket updates and worked the entire 6 hours to get things working again.
I’m sorry the site was down for so long. I’m working with Linode management to make sure it doesn’t happen again. Furthermore, I’m also looking at options for shared file systems so that a pod can move from node to node seamlessly.