The site has not been as stable as I want it to be. We are experiencing a failure about once every 48-72hours. The outage normally lasts less than 5 minutes. Today it exceeded 5 minutes.
I know what the issue is. K8S is killing off parts of the infrastructure. Normally, it is the database engine.
When the database goes down, the site tells K8S that it is sick. This results in the 503 errors you might have seen.
The root cause is that K8S doesn’t think there are enough resources available and “reaps” something, normally the RDBMS.
The fix for this is to move from rook-ceph with an internal cluster to rook-ceph with an external cluster. The advantage of an external cluster is that it requires less resources within K8S, and I have better control over it.
I have created an external cluster within my own K8S test system. I’m in the process of documenting how to bring up a K8S external cluster. It isn’t working yet. I’ll get there.