2020-08-28 - Memory overload on WER cluster#
Summary#
On 2020-08-28, WER reported stuck pages for students. A total outage, nothing usable.
After investigation, we determined that the core pods didn’t have appropriate resource guarantees set. There was also no dedicated core pool, so the WER students overloaded CPU & RAM of the nodes. This starved everything of resources, causing issues.
This was resolved by:
Giving core pods more resource guarantees
Removing memory overcommit for WER students, since they seem to be using a good chunk of their memory limit.
Timeline#
All times in IST
08:52 PM#
Incoming report that many students can not access the hub, and it is frozen
09:02 PM#
Activity bump is noticed but regular fixes (incognito, restarting servers, etc) don’t seem to fix things
09:21 PM#
Looking at resource utilization on the nodes, resource exhaustion is clear
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-low-touch-hubs-cluster-core-pool-b7edea69-00sc 220m 11% 6151Mi 58%
gke-low-touch-hubs-cluster-core-pool-b7edea69-gwrg 1944m 100% 10432Mi 98%
There were only core nodes - no separate user nodes. The suspicion is that the user pods are using up just enough resources that the core pods are being starved.
09:23 PM#
Based on tests on how much RAM WER needs, we had set a limit of 2G but guarantee of only 512M - a 4x overcommit as we often do. However, the tests revealed that users almost always use just under 1G of RAM, so our overcommit should’ve been just 2x. We just remove overcommit for now. This will also probably spawn another node, thus easing pressure on the other existing nodes.
09:24 PM#
We bump resource guarantees for all the core pods as well, so they will have enough to operate even if the nodes get full. This restarts the pods, and moves some to a new node - which also helps. Things seem to return to normal.
09:46 PM#
The issue is closed and everything seems fine
Action Items#
Make sure user pods are in a separate pool, so they do not create pressure on the core pods 2i2c-org/infrastructure#89
Set limits on the support infrastructure (prometheus, grafana, ingress) as well 2i2c-org/infrastructure#90
Document and think about overcommit ratios for memory usage 2i2c-org/infrastructure#91
Setup better Grafana dashboards to monitor resource usage 2i2c-org/infrastructure#92
Document how folks can get
kubectl
access to the cluster, so others can look into issues too 2i2c-org/infrastructure#87