On Friday, April 10 at about 6:30 UTC, a public cloud server rebooted due to a kernel panic. In case of a Pagoda Box server reboot, we automatically export a core dump which identifies the exact processes running on the server at the time of failure. Engineers immediately began to review the data and identified a known bug in the ZFS virtualization on SmartOS (https://smartos.org/bugview/OS-3838).
At about the same time, the server fully recovered and began to process queued transactions. Unfortunately, one of those queued transactions was the transaction which triggered the initial outage. Resuming that paused transaction caused the server to immediately reboot again, even while engineers were searching for an offending transaction matching the reported bug.
While the server recovered the second time, engineers successfully identified and isolated the transaction responsible for the outage, and removed it from the queued transaction list. After the server recovered for the second time, queued transactions resumed without further incident.
In the last 30 days, engineers have optimized the server reboot / recovery process significantly. Compared to the server outage on March 9, today's recovery occurred in 1/7th the time with only minimal intervention (as detailed above).
Specific to this outage, Pagoda Box engineers have already been in communication with both ZFS and SmartOS engineers today, and will continue to review this issue in depth to help identify a permanent solution.