It's OK, it’s in the Cloud: Lessons from the OVH Cloud Data Centre Fire
This week, I talk about the issues associated with the fire in the OVH cloud data centre and how ‘putting your IT in the cloud’ is not a risk-free solution.
Working from home: Is your business continuity problem solved?
There seemed to be a moment sometime last year, when many issues associated with business continuity were solved, and the world became more resilient. The go-to issue associated with business continuity was always the building burning down. Then business continuity managers would spend large amounts of time, effort, and money, to make sure that they were able to cope with this scenario.
COVID-19 forced organisations to ask their staff to start working from home. Almost overnight, one of the key risks business continuity managers were planning for, was solved. Now, most organisations can easily work from home. Never again do we need to worry about the loss of our office buildings, in the same way!
The loss of staff is not an issue:
In my opinion, the same way that COVID-19 has proved that loss of staff is perhaps less of an issue than we originally thought. Having worked for a water and electricity provider in the Caribbean. I noticed that at one time they had 25% of their staff ill at home due to COVID-19. Most of their jobs cannot be done from home, yet they coped with this lack of staff.
Interestingly, most pandemic plans I have worked on or read were all about the loss of staff, looking at anything between 10% - 25% of their staff being off sick. Organisations may have had this percentage of staff off at the peak of the pandemic, but overall have been able to cope with it. Supply chain loss is always an issue but, with the trend of onshoring and in many cases multiple vendors available, it has become less risky.
That brings us to IT, I find that IT managers don’t like to admit that there is a risk associated with the provision of their organisation’s systems. Whether it be a cyber threat or a physical loss, they say they’ve got it covered. Sometimes, exercises can be a nightmare if you want to practice an IT scenario, especially when your IT person says, “that would never happen” and lists the mitigation measures in place.
The fire and impact:
The fire OVHcloud had in their data centre campus in Strasburg, reminds us, that just because we have put our applications in place or have bought software that is hosted in the cloud, we are still vulnerable to IT outages.
The fire on 10th March destroyed one of their four data centres onsite, damaged another, and power was cut to the other two, which turned them offline. The fire took six hours to put out which shows how substantial it was! According to Netcraft, the fire took “3.6 million websites across 464,000 distinct domains” offline. This was comprised of websites such as "online banks, webmail services, news sites, online shops selling PPE, and several countries’ government websites, these included websites used by the Polish Financial Ombudsman, the Ivorian DGE, the French Plate-forme des achats de l’Etat, the Welsh Government’s Export Hub, and the UK Government’s Vehicle Certification Agency website."
The impact which proved to be catastrophic was the complete loss of data stored in the centre for 6000 customers. There was no backup, the backups weren't encrypted, and the keys that were stored on a server were also lost in the fire. Unfortunately, the data centre didn’t routinely back up all their customer data as this was not a part of the basic service offered.
Why use cloud services:
There are four basic reasons organisations use cloud services:
- They are cheaper than hosting the servers yourself, providing you own boxes and infrastructure.
- The cloud provides a more resilient infrastructure, as the economies of scale allow for multiple power feeds, redundancy in chillers and generators, high security and access. This is usually a higher level of infrastructure and resilience than individual organisations can afford.
- Large capital investment is not needed as services are rented and they are easy to scale up if your organisation needs more capacity.
- Software as a service (SaaS) is very convenient for many organisations, and it massively simplifies their IT management.
There are lots of benefits to having your organisation’s data, applications, and software as a service in the cloud. What are the lessons I identified from the OVHcloud fire?
- You need to understand where your data is housed. Just because it's in the cloud doesn’t mean it is floating around the sky, all clouds are data centres that have a physical presence somewhere.
- Check the infrastructure of your data centre. I worked for a German data centre provider and wrote a business continuity plan for them. Their building was huge, beautifully maintained, right down to the UPS batteries all being all lined up, not one millimetre out of place in neat rows. They had 16 standby generators, and at least 4 were spare when the centre operated with the generator fully load, as well as excess chillers, fire suppression systems, everything you want from a data centre. In the case of OVHcloud, I was somewhat surprised to learn that the data centre which burned down was housed in containers stacked on top of each other. The adage applies to data centres as well as everything else “if it’s cheap, it's probably cheap for a reason”.
- Check your contract. CEO Octave Klaba said in the future he would offer all his client’s backups as part of their contract. For the 6000 customers who have lost their data, it's a bit late. You need to read the contract and check what is backed up and what is not, don’t just assume that because it’s on the cloud, all the risks are covered, and your data will be backed up automatically.
- Fires are very destructive. The data centre is slowly being restored and the data centres SGB2 servers were taken to another OVHcloud data centre to be restored. Servers took 7 hours to clean so the recovery was not completed quickly. Your organisation is completely at the mercy of the data centre provider, depending on when they clean and restore systems then it may be too late to try and make your recovery a priority.
- If you use SaaS, where your data is stored, and the backup arrangements are often very transparent. This is especially prevalent in very large providers such as Microsoft or Salesforce. They are hard to engage with on matters of data resiliency and all you have is a statement on their website of what their backup arrangements are. Occasionally, data for SaaS customers is lost, so unless you are a large or powerful organisation you are at the mercy of the providers, and there is very little you can do.
- Some cloud providers will host multiple iterations of your data, applications, and website so if one data centre goes down then another provides the service, and your customers don’t notice. You need to check whether this is provided as part of the service or if is it is an additional cost. You also need to check if the alternative data centre is allocated next to the first data centre. In this case, the loss of 3.6 million websites would point to the fact that either there was no alternative data centre hosting or the alternative data centre wasn't physically on the same site as the other three.
The following tweet from the Rust multiplayer-only survival video game, the day after the incident says it all:
In conclusion, if you have services in the cloud, you should explore where they are hosted, what level of protection the hosting company provides, and what is stated in your contract in terms of backup and alternative hosting. The cloud is a great solution for many organisations but in outsourcing your IT hosting, you can “outsource the activity but not the risk” and so you need to ensure that you understand the risks associated with using a cloud provider.
Just a quick mention to our reader Chloe, who said she finds the bulletin extremely helpful, and to Brian Zwarda for his comments on my bulletin on Operational Resilience and Business Continuity. We love to hear your thoughts!