Public Cloud Server Connection Interruption
Incident Report for Pagoda Box

On Friday, April 10 at about 6:30 UTC, a public cloud server rebooted due to a kernel panic. In case of a Pagoda Box server reboot, we automatically export a core dump which identifies the exact processes running on the server at the time of failure. Engineers immediately began to review the data and identified a known bug in the ZFS virtualization on SmartOS (https://smartos.org/bugview/OS-3838).

At about the same time, the server fully recovered and began to process queued transactions. Unfortunately, one of those queued transactions was the transaction which triggered the initial outage. Resuming that paused transaction caused the server to immediately reboot again, even while engineers were searching for an offending transaction matching the reported bug.

While the server recovered the second time, engineers successfully identified and isolated the transaction responsible for the outage, and removed it from the queued transaction list. After the server recovered for the second time, queued transactions resumed without further incident.

What We’re Doing to Fix Outages

In the last 30 days, engineers have optimized the server reboot / recovery process significantly. Compared to the server outage on March 9, today's recovery occurred in 1/7th the time with only minimal intervention (as detailed above).

Specific to this outage, Pagoda Box engineers have already been in communication with both ZFS and SmartOS engineers today, and will continue to review this issue in depth to help identify a permanent solution.

Posted almost 2 years ago. Apr 10, 2015 - 23:23 UTC

Resolved
This incident has been resolved.
Posted almost 2 years ago. Apr 10, 2015 - 20:29 UTC
Update
100% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:18 UTC
Update
94% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:17 UTC
Update
88% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:15 UTC
Update
79% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:13 UTC
Update
68% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:11 UTC
Update
57% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:09 UTC
Update
47% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:07 UTC
Update
31% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:05 UTC
Update
11% Recovered.
Posted almost 2 years ago. Apr 10, 2015 - 20:02 UTC
Update
We have identified a corrupt data set (file system) on the effected server which crashes the server on 'delete'. We are putting in a temporary patch to keep the server online, and will address the actual bug following the outage (https://smartos.org/bugview/OS-3838).
Posted almost 2 years ago. Apr 10, 2015 - 20:02 UTC
Investigating
Engineers are investigating connections interruptions with one of the public cloud servers. The server successfully rebooted and started all services, but immediately crashed again.
Posted almost 2 years ago. Apr 10, 2015 - 19:49 UTC