Public Cloud Server Interruption
Incident Report for Pagoda Box

Series of Events

​ At approximately 7:50am MST, a hypervisor abruptedly rebooted as a result of a kernel panic. The Hypervisor booted up normally within 10 minutes and after approximately 15 minutes of virtual-network syncronization, the containers began to boot. Within 20 minutes all of the containers were running and we began our post-recovery checklist to ensure all apps were restored to proper function. At approximately 8:53, our post-recovery verification was complete and the hypervisor was added back into the pool. ​

Root Cause Analysis

​ The inspection of the core dump (kernel stack trace triggered by the panic) revealed an attempt to dereference a null pointer. The offending kernel module was the dnlc (Directory Name Lookup Cache) module. The purpose of the dnlc module is to maintain a mapping of filehandles to zones (containers) so that filesystem lookups within zones are quick. Approximately 14 months ago, we identified a performance bottleneck within this module that was negatively impacting zone create, destroy, start, stop, and reboot operations.

Prior to our investigation, the last modification to that module was authored by ATT in 1988, so it was understandable that currently circumstances weren't considered nearly 30 years ago. We made a significant refactor and enhancement to the dnlc module which decreased zones operations from 5 minutes to 2 seconds. This was a success, and after a series of rigorous tests, the new kernel was released. Approximately 3 months after the enhanced module was released, it was discovered that a rare scenario could result in a null pointer dereference. A patch was made, and the kernel was released again.

At this point it was determined that the scenario is quite rare, and that the likelihood of such a trigger did not require a forced reboot maintenance window. The incident today was caused by this rare scenario. As the hypervisor came back online, the running kernel now has the patch in place. ​

Conclusion

​ We sincerely apologize for the disruption that this has caused. This rare, isolated incident marks the first kernel panic in 10 months. We truly appreciate your understanding and patience.

Lastly, we would just like to warmly remind that while we are constantly working to eliminate any and all potentially disruptive scenarios, rare incidents like today still happen amid our best efforts. Please remember to enable/add redundancy to all of your mission critical applications. To that end, we would like to confirm that all applications that had redundancy enabled remained online during the brief disruption. Thank you again for your understanding and support.

Posted about 1 year ago. Mar 28, 2016 - 15:39 UTC

Resolved
All affected services have been restored. A post mortem will be posted shortly.
Posted about 1 year ago. Mar 28, 2016 - 15:36 UTC
Monitoring
All affected applications should now be back online. Engineers are working through stuck transactions. If your app was affected and is still offline, try triggering a rebuild or a repair on the affected service.
Posted about 1 year ago. Mar 28, 2016 - 14:54 UTC
Update
User instances and network connections are coming back online.
Posted about 1 year ago. Mar 28, 2016 - 14:36 UTC
Update
Engineers have brought the affected server back online and are working to restore network connections to user instances.
Posted about 1 year ago. Mar 28, 2016 - 14:27 UTC
Identified
A public cloud server has rebooting. All application services on the affected server are temporarily unavailable. Engineers are bringing the server back online now.
Posted about 1 year ago. Mar 28, 2016 - 14:03 UTC