Public Cloud Server Interruption
Incident Report for Pagoda Box

Two public cloud servers were affected by a known race-condition related to the virtual filesystem. If a file path is resolved to a vnode inside of an NFS mount at the same time the NFS mount is unmounted, then the file is deleted. This causes a race condition where the Directory Name Lookup Cache (DNLC) is being cleared at the same time that a vnode is being removed from the cache, which results in a host machine reboot.

This was a known race condition that had already been patched in an updated server image, but required a server reboot to apply. This race condition was and is extremely rare. Because of its rarity, it was decided to forgo purposely rebooting servers in order to apply the new image. Each of the affected servers was booted using the updated image and are no longer susceptible to the race condition.

The first server was affected by the race condition at approximately 16:30 UTC and was fully recovered at approximately 17:10 UTC. In the process of the first server coming back online, the same race condition was triggered in a second server housing a service dependent on an NFS service on the first affected server. The second server rebooted at approximately 17:05 UTC and was fully recovered at approximately 17:45 UTC.

Posted over 2 years ago. Jun 09, 2015 - 18:16 UTC

Resolved
This incident has been resolved.
Posted over 2 years ago. Jun 09, 2015 - 17:51 UTC
Update
All user services should now be back online. If your app is still affected, please submit a ticket or let us know in our IRC.
Posted over 2 years ago. Jun 09, 2015 - 17:49 UTC
Update
100% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:47 UTC
Update
91% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:44 UTC
Update
70% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:39 UTC
Update
56% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:36 UTC
Update
30% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:31 UTC
Update
User services on the 2nd affected cloud server are coming back online - 13% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:29 UTC
Update
A 2nd public cloud server has been affected by the same race-condition. This is a direct result of the 1st public cloud server interruption. The 2nd server is now in the process of rebooting with the updated image that includes the patch to fix the race condition.
Posted over 2 years ago. Jun 09, 2015 - 17:17 UTC
Update
100% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:11 UTC
Update
92% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:07 UTC
Update
85% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:03 UTC
Update
75% Recovered
Posted over 2 years ago. Jun 09, 2015 - 17:00 UTC
Update
65% Recovered
Posted over 2 years ago. Jun 09, 2015 - 16:56 UTC
Update
54% Recovered
Posted over 2 years ago. Jun 09, 2015 - 16:53 UTC
Update
42% Recovered
Posted over 2 years ago. Jun 09, 2015 - 16:49 UTC
Update
32% Recovered
Posted over 2 years ago. Jun 09, 2015 - 16:47 UTC
Update
20% Recovered
Posted over 2 years ago. Jun 09, 2015 - 16:44 UTC
Update
10% Recovered
Posted over 2 years ago. Jun 09, 2015 - 16:43 UTC
Update
User services are coming back online.
Posted over 2 years ago. Jun 09, 2015 - 16:41 UTC
Identified
A public cloud server is in the process of restarting. Engineers have identified the cause of the interruption and the the server is coming back online. It was caused by a known race condition related to NFS mounts. The server is now rebooting with an updated image that fixes the race-condition.

The patch that fixes the race condition required a server reboot, which is why it had not yet been applied. This reboot has provided the opportunity to implement the patch.
Posted over 2 years ago. Jun 09, 2015 - 16:35 UTC