Public Cloud Server Reboot
Incident Report for Pagoda Box

At approximately 19:10 UTC, a public cloud server stopped reporting and rebooted. User services on the affected server became unresponsive as the server was coming back online. After approximately 40 minutes, service was restored to all user services/instances.

The Cause

A race condition was introduced in the upstream build of SmartOS, the operating system used on Pagoda Box servers. The technical explanation of the race condition can be found here.

What We're Doing to Fix It

A patch has been released to fix the issue. We are running an emergency OS image build to include the patch. However, the affected server booted into the image with the race condition and will need to be updated. We will schedule a maintenance window that will minimize the impact on users, during which, we'll replace the current server image(s) with the patched image.

Posted about 2 years ago. Apr 17, 2015 - 20:27 UTC

Resolved
This incident has been resolved.
Posted about 2 years ago. Apr 17, 2015 - 20:25 UTC
Monitoring
All user instances should now be fully functional. If you are still experiencing issues, you may need to rebuild code services or repair data services.
Posted about 2 years ago. Apr 17, 2015 - 19:54 UTC
Update
100% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:54 UTC
Update
98% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:52 UTC
Update
93% Recovered
Posted about 2 years ago. Apr 17, 2015 - 19:50 UTC
Update
89% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:48 UTC
Update
80% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:44 UTC
Update
75% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:41 UTC
Update
69% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:39 UTC
Update
60% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:35 UTC
Update
50% Recovered
Posted about 2 years ago. Apr 17, 2015 - 19:32 UTC
Update
42% Recovered
Posted about 2 years ago. Apr 17, 2015 - 19:29 UTC
Update
38% Recovered
Posted about 2 years ago. Apr 17, 2015 - 19:27 UTC
Update
30% Recovered
Posted about 2 years ago. Apr 17, 2015 - 19:25 UTC
Update
22% Recovered.
Posted about 2 years ago. Apr 17, 2015 - 19:23 UTC
Update
User instances are coming back online now.
Posted about 2 years ago. Apr 17, 2015 - 19:18 UTC
Identified
A public cloud server has rebooted. Engineers have identified a CPU spike just before the server last reported and are investigating the cause.
Posted about 2 years ago. Apr 17, 2015 - 19:09 UTC