2020-08-28 - Memory overload on WER cluster#

Summary#

On 2020-08-28, WER reported stuck pages for students. A total outage, nothing usable.

After investigation, we determined that the core pods didn’t have appropriate resource guarantees set. There was also no dedicated core pool, so the WER students overloaded CPU & RAM of the nodes. This starved everything of resources, causing issues.

This was resolved by:

  1. Giving core pods more resource guarantees

  2. Removing memory overcommit for WER students, since they seem to be using a good chunk of their memory limit.

Timeline#

All times in IST

08:52 PM#

Incoming report that many students can not access the hub, and it is frozen

09:02 PM#

Activity bump is noticed but regular fixes (incognito, restarting servers, etc) don’t seem to fix things

09:21 PM#

Looking at resource utilization on the nodes, resource exhaustion is clear

$ kubectl top node                                                                   
NAME                                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-low-touch-hubs-cluster-core-pool-b7edea69-00sc   220m         11%    6151Mi          58%       
gke-low-touch-hubs-cluster-core-pool-b7edea69-gwrg   1944m        100%   10432Mi         98%       

There were only core nodes - no separate user nodes. The suspicion is that the user pods are using up just enough resources that the core pods are being starved.

09:23 PM#

Based on tests on how much RAM WER needs, we had set a limit of 2G but guarantee of only 512M - a 4x overcommit as we often do. However, the tests revealed that users almost always use just under 1G of RAM, so our overcommit should’ve been just 2x. We just remove overcommit for now. This will also probably spawn another node, thus easing pressure on the other existing nodes.

09:24 PM#

We bump resource guarantees for all the core pods as well, so they will have enough to operate even if the nodes get full. This restarts the pods, and moves some to a new node - which also helps. Things seem to return to normal.

09:46 PM#

The issue is closed and everything seems fine

Action Items#