Public Cloud Server Connection Interruption
Incident Report for Pagoda Box

On Friday, April 10 at about 6:30 UTC, a public cloud server rebooted due to a kernel panic. In case of a Pagoda Box server reboot, we automatically export a core dump which identifies the exact processes running on the server at the time of failure. Engineers immediately began to review the data and identified a known bug in the ZFS virtualization on SmartOS (https://smartos.org/bugview/OS-3838).

At about the same time, the server fully recovered and began to process queued transactions. Unfortunately, one of those queued transactions was the transaction which triggered the initial outage. Resuming that paused transaction caused the server to immediately reboot again, even while engineers were searching for an offending transaction matching the reported bug.

While the server recovered the second time, engineers successfully identified and isolated the transaction responsible for the outage, and removed it from the queued transaction list. After the server recovered for the second time, queued transactions resumed without further incident.

What We’re Doing to Fix Outages

In the last 30 days, engineers have optimized the server reboot / recovery process significantly. Compared to the server outage on March 9, today's recovery occurred in 1/7th the time with only minimal intervention (as detailed above).

Specific to this outage, Pagoda Box engineers have already been in communication with both ZFS and SmartOS engineers today, and will continue to review this issue in depth to help identify a permanent solution.

Posted almost 2 years ago. Apr 10, 2015 - 23:26 UTC

Resolved
This incident has been resolved.
Posted almost 2 years ago. Apr 10, 2015 - 19:46 UTC
Update
Recovery is 100% Complete, and queued transactions are resuming.
Posted almost 2 years ago. Apr 10, 2015 - 19:45 UTC
Update
94% Complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:36 UTC
Update
81% Complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:33 UTC
Update
71% Complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:30 UTC
Update
60% Complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:28 UTC
Update
51% Complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:26 UTC
Update
41% Complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:24 UTC
Update
30% complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:22 UTC
Update
20% complete.
Posted almost 2 years ago. Apr 10, 2015 - 19:20 UTC
Update
The server has been restarted, and services are being restarted.
Posted almost 2 years ago. Apr 10, 2015 - 19:20 UTC
Identified
A Public Cloud server is offline. Engineers are working to restore services.
Posted almost 2 years ago. Apr 10, 2015 - 18:38 UTC