This incident is now fully resolved and we have had the chance to look back over what happened, summarise it, learn from it and outline the measures we have put in place to ensure that this will never happen again. This is the first time in 15 years we have suffered from an outage of this scale on our shared hosting platform. Thankfully our disaster recovery process came into play and restored all services back to normal. We know how important our services are to you and and we are truly sorry for the effect this outage had on your business.
At 1:42am on the morning of October 27 our systems administrators became aware of an issue affecting multiple servers in our shared hosting environment. We became aware that the issue was caused by a script associated with a product called MailScanner, which is an email scanning tool used by most cPanel hosting providers around the world.
The MailScanner quarantine script contained an invalid check and had run on a number of shared web hosting servers removing core system files older than 7 days rendering the servers inoperable. These files are required to run the operating system and services associated with hosting websites. A team was assembled and the script was immediately deleted and stopped to prevent further file removal.
Our disaster recovery process was then initiated and a plan was devised to systematically restore system files that that had been removed to recover the servers as quickly as possible. As we backup our servers 3 times daily and retain these backups for 7 days there was no loss of existing data.
A file based restore takes significantly longer than a full restoration as each individual file (potentially millions) have to be checked for existence and only restored if missing. This restore process however does ensure that the servers are returned to the state they were prior to the issue and ensures the integrity of data created after the most recent backup.
By 11:30am on October 27 the first of the restorations were completing and some services were returned to normal operation. The restoration process continued until all services had been fully restored.
Once the restoration process had completed we then ran full integrity checks over all of our servers and this process was completed at 12pm on October 30th and the incident was marked resolved at 12:58am on the same day.
The root cause of the issue was due to the MailScanner script containing invalid checks to ensure that it was removing files from the correct directory when removing files from quarantine. The script was unable to access the MailScanner configuration file and subsequently began to remove files from the root directory (/) of the server.
We have contacted the development lead of MailScanner and he has replied assuring that this bug will be resolved ASAP so that this issue cannot affect any other hosting providers.
We have now completely removed Mailscanner from our systems as we are using SpamExperts in its place.
We have conducted a full review of all third-party scripts running on our shared hosting environment.
We have checked all systems and ensured everything has returned to normal and all updates have been applied.
We are investigating further ways we can improve our restoration process to ensure that in the event of a disaster we can minimise downtime.