Public Cloud Server Reboot
Incident Report for Pagoda Box

At approximately 19:10 UTC, a public cloud server stopped reporting and rebooted. User services on the affected server became unresponsive as the server was coming back online. After approximately 40 minutes, service was restored to all user services/instances.

The Cause

A race condition was introduced in the upstream build of SmartOS, the operating system used on Pagoda Box servers. The technical explanation of the race condition can be found here.

What We're Doing to Fix It

A patch has been released to fix the issue. We are running an emergency OS image build to include the patch. However, the affected server booted into the image with the race condition and will need to be updated. We will schedule a maintenance window that will minimize the impact on users, during which, we'll replace the current server image(s) with the patched image.

Posted almost 2 years ago. Apr 17, 2015 - 20:27 UTC

Resolved
This incident has been resolved.
Posted almost 2 years ago. Apr 17, 2015 - 20:25 UTC
Monitoring
All user instances should now be fully functional. If you are still experiencing issues, you may need to rebuild code services or repair data services.
Posted almost 2 years ago. Apr 17, 2015 - 19:54 UTC
Update
100% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:54 UTC
Update
98% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:52 UTC
Update
93% Recovered
Posted almost 2 years ago. Apr 17, 2015 - 19:50 UTC
Update
89% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:48 UTC
Update
80% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:44 UTC
Update
75% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:41 UTC
Update
69% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:39 UTC
Update
60% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:35 UTC
Update
50% Recovered
Posted almost 2 years ago. Apr 17, 2015 - 19:32 UTC
Update
42% Recovered
Posted almost 2 years ago. Apr 17, 2015 - 19:29 UTC
Update
38% Recovered
Posted almost 2 years ago. Apr 17, 2015 - 19:27 UTC
Update
30% Recovered
Posted almost 2 years ago. Apr 17, 2015 - 19:25 UTC
Update
22% Recovered.
Posted almost 2 years ago. Apr 17, 2015 - 19:23 UTC
Update
User instances are coming back online now.
Posted almost 2 years ago. Apr 17, 2015 - 19:18 UTC
Identified
A public cloud server has rebooted. Engineers have identified a CPU spike just before the server last reported and are investigating the cause.
Posted almost 2 years ago. Apr 17, 2015 - 19:09 UTC