Service Interruption - Shared and Reseller Hosting

Incident Report for Digital Pacific

Postmortem

This incident is now fully resolved and we have had the chance to look back over what happened, summarise it, learn from it and outline the measures we have put in place to ensure that this will never happen again. This is the first time in 15 years we have suffered from an outage of this scale on our shared hosting platform. Thankfully our disaster recovery process came into play and restored all services back to normal. We know how important our services are to you and and we are truly sorry for the effect this outage had on your business.

Summary

At 1:42am on the morning of October 27 our systems administrators became aware of an issue affecting multiple servers in our shared hosting environment. We became aware that the issue was caused by a script associated with a product called MailScanner, which is an email scanning tool used by most cPanel hosting providers around the world.

The MailScanner quarantine script contained an invalid check and had run on a number of shared web hosting servers removing core system files older than 7 days rendering the servers inoperable. These files are required to run the operating system and services associated with hosting websites. A team was assembled and the script was immediately deleted and stopped to prevent further file removal.

Our disaster recovery process was then initiated and a plan was devised to systematically restore system files that that had been removed to recover the servers as quickly as possible. As we backup our servers 3 times daily and retain these backups for 7 days there was no loss of existing data.

A file based restore takes significantly longer than a full restoration as each individual file (potentially millions) have to be checked for existence and only restored if missing. This restore process however does ensure that the servers are returned to the state they were prior to the issue and ensures the integrity of data created after the most recent backup.

By 11:30am on October 27 the first of the restorations were completing and some services were returned to normal operation. The restoration process continued until all services had been fully restored.

Once the restoration process had completed we then ran full integrity checks over all of our servers and this process was completed at 12pm on October 30th and the incident was marked resolved at 12:58am on the same day.

Root Cause

The root cause of the issue was due to the MailScanner script containing invalid checks to ensure that it was removing files from the correct directory when removing files from quarantine. The script was unable to access the MailScanner configuration file and subsequently began to remove files from the root directory (/) of the server.

Corrective and Preventative Measures

We have contacted the development lead of MailScanner and he has replied assuring that this bug will be resolved ASAP so that this issue cannot affect any other hosting providers.

We have now completely removed Mailscanner from our systems as we are using SpamExperts in its place.

We have conducted a full review of all third-party scripts running on our shared hosting environment.

We have checked all systems and ensured everything has returned to normal and all updates have been applied.

We are investigating further ways we can improve our restoration process to ensure that in the event of a disaster we can minimise downtime.

Posted Nov 04, 2016 - 10:05 AEDT

Resolved

Our technicians have completed checks and we are now marking this incident as resolved. An incident postmortem will be posted in the coming days. If you continue to have any further issues please submit a support request here -> https://onepanel.digitalpacific.com.au -> On behalf of the whole team at Digital Pacific we would like to thank you for your patience and we are sincerely sorry for the issues experienced this week.

Posted Oct 30, 2016 - 12:58 AEDT

Monitoring

Our team will continue to work over the weekend, checking every website, to ensure that there are no further issues resulting from this incident. If you have any concerns during this time please submit a support request here -> https://onepanel.digitalpacific.com.au -> If there are no further issues, we will be updating this incident to resolved on Monday.

Posted Oct 28, 2016 - 16:47 AEDT

Update

No new developments. As mentioned, the restorations are continuing into today for a very small number of sites that are still experiencing issues. Email should be functioning normally and the mail queue is still running.

Posted Oct 28, 2016 - 12:21 AEDT

Update

No new updates to report. We are keeping this incident open until we are sure that all customers are back online and there are no further issues. Most customers have been returned to full operation and we are focusing on finding and resolving any further issues.

Posted Oct 28, 2016 - 10:42 AEDT

Update

Mail delivery is flowing smoothly. The majority of the 503 errors are now resolved. Services are still being restored for a small number of customers and we are working through the issues as they are discovered.

Posted Oct 28, 2016 - 09:18 AEDT

Update

A solution for the 503 errors has been discovered and is being rolled out across affected sites which will bring them back online very soon. The mail queues have returned to normal and all backlogged email should be delivered soon. Once the 503 errors have been resolved, the majority of websites will be back to normal and we will continue to restore remaining services throughout the day.

Posted Oct 28, 2016 - 06:37 AEDT

Update

Several servers have completed restorations, and mail queues are nearly completed.

We are aware that some sites are returning 503 errors, and we're working with cPanel support on this.

We will aim to provide another update before 7AM.

Posted Oct 28, 2016 - 04:41 AEDT

Update

We're continuing on restoring services. Mail queues are progressing faster than expected. We will aim to provide another update before 5AM.

Posted Oct 28, 2016 - 03:08 AEDT

Update

Our team is continuing to work on restorations as per the previous update, and will continue through out the night.

Mail queues are progressing as anticipated, and should be completed within a few hours.

Posted Oct 28, 2016 - 01:49 AEDT

Update

Thank you for your patience while we work through these issues. Our team is still working on resolving outstanding issues with websites which will continue through the night.

Recap: At around 2AM on the 27th of October 2016 a MailScanner script ran which removed important system files from some of our shared hosting servers. There was no hardware failure or data loss. We have been working since 2AM to restore access to your service and this is our number 1 priority and we will continue to work around the clock until this issue is completely resolved. We’ve disabled any further scripts from running until we have had the opportunity to fully review what happened.

We understand how important our services are to the operation of your business and we are truly sorry for the inconvenience that this outage will have caused you today.

We will provide you with another update before 4AM.

Posted Oct 28, 2016 - 00:02 AEDT

Update

The team is still working on resolving all of the issues and mail is being delivered to all but a handful of remaining accounts. The mail queue does have a large backlog which will be resolved over the coming hours as mail is delivered. We will provide another update at midnight tonight.

Posted Oct 27, 2016 - 22:23 AEDT

Update

Email service is now restored to over 95% of affected services. The mail queue is still being processed so any missed emails will continue to flow into your inbox throughout the evening. For the remaining accounts our team is still working hard on getting things back online and we will continue to update you on this process throughout the night.

Posted Oct 27, 2016 - 21:13 AEDT

Update

Over 85% of services now have email functioning smoothly again. We expect this to be 100% over the next few hours. Our sysadmins are still working towards complete resolution as soon as possible.

Posted Oct 27, 2016 - 19:56 AEDT

Update

Over 60% of affected websites are now back online. We expect that overnight we will resume 100% email deliverability across all affected accounts, but a small number of website issues may continue into tomorrow. At this time we are unable to provide specific time-frames on individual websites but our team is working non-stop to ensure that all services are restored in the quickest possible time-frame.

Thank you for your patience while we resolve these issues and again we are truly very sorry that this incident has occurred.

Posted Oct 27, 2016 - 17:37 AEDT

Update

Many more sites are now up and the number is ever growing by the minute. A team of Sysadmins has been formed to continue working into the night. We have also tripled the number of support staff overnight to be able to continue responding to any enquiries you may have during this difficult time.

Recap: At around 2am on the 27th of October 2016 a MailScanner script ran which removed important system files from some of our shared hosting servers. There was no hardware failure or data loss. We have been working since 2am to restore access to your service and this is our number 1 priority and we will continue to work around the clock until this issue is completely resolved. We’ve disabled any further scripts from running until we have had the opportunity to fully review what happened.

We understand how important our services are to the operation of your business and we are truly sorry for the inconvenience that this outage will have caused you today.

Posted Oct 27, 2016 - 16:11 AEDT

Update

No new information to report at this time. All servers are using the faster recovery method and a large number of sites have come back online in the last hour. We will continue to bring sites back online as soon as possible.

Posted Oct 27, 2016 - 14:26 AEDT

Update

The recovery is still progressing and we have discovered a few ways to increase the performance of this process. Sites are continuing to come back online as day goes on. The recovery time-frame remains unchanged and we are doing everything we can to get you back online.

Posted Oct 27, 2016 - 13:14 AEDT

Update

Multiple servers are now completely recovered which proves that our plan for recovery is sound. The rest are in progress and all available resources have been allocated to these servers to ensure that they are recovered as quickly as possible.

To Recap: Overnight a MailScanner script ran which removed important system files from some of our shared hosting servers. There was no hardware failure or data loss. We have been working since 2am to restore access to your service and this is our number 1 priority and we will continue to work around the clock until this issue is completely resolved. We’ve disabled any further scripts from running until we have had the opportunity to fully review what happened.

We understand how important our services are to the operation of your business and we are truly sorry for the inconvenience that this outage will have caused you today.

Posted Oct 27, 2016 - 12:04 AEDT

Update

No new information to report. Websites are coming back online as the affected servers become operational.

Posted Oct 27, 2016 - 11:30 AEDT

Update

We are continuing to progressively restore access to affected services. We have all staff working on this issue to ensure the quickest possible recovery time.

Posted Oct 27, 2016 - 10:58 AEDT

Update

We are experiencing an issue that is affecting some shared and reseller hosting accounts. No data has been lost and mail is queuing for delivery. We are working hard to restore access to services as quickly as possible. As the issues are resolved, affected websites and emails will come back online over the next 4 - 12 hours.

Overnight our technicians were notified by our monitoring to an issue with our hosting environment. This issue was caused by a MailScanner update script removing some key files from the web server that were older than 7 days. These files are in our backups and are being systematically restored to bring our systems back online. We’ve disabled any further scripts from running until we have had the opportunity to fully review what happened.

We are truly sorry for this interruption and a full post-mortem will be released after the incident but for now all of our resources are directed at bringing things back online.

Posted Oct 27, 2016 - 09:53 AEDT

Update

We are continuing to work on restoring affected services. Further updates will be provided in the next 30-60 minutes.

Posted Oct 27, 2016 - 09:53 AEDT

Update

We are continuing to work on restoring affected services. Further updates will be provided in the next 30-60 minutes.

Posted Oct 27, 2016 - 08:21 AEDT

Update

Staff are actively working on this issue. We will provide another update within 30 to 60 minutes.

Posted Oct 27, 2016 - 07:09 AEDT

Update

A number of Personal, Business and Reseller shared-hosting servers are currently unavailable, which may be impacting access to some websites and email services.

We have identified the outage to be caused by malfunctioning software. This software has been disabled and we are currently executing our disaster-recovery procedures.

Our full technical team is working to restore all services as quickly as possible.

We will provide updates on our progress here every 30-60 minutes.

Posted Oct 27, 2016 - 06:01 AEDT

Identified

We have identified the issue and are continuing to work on resolving this as an urgent priority.

Posted Oct 27, 2016 - 03:44 AEDT

Update

We are continuing to investigate this issue and have engaged our upstream providers.

More details to follow.

Posted Oct 27, 2016 - 03:10 AEDT

Investigating

We are currently experiencing issues with our Shared Hosting Network, Our engineers are currently investigating.

Posted Oct 27, 2016 - 01:55 AEDT