February 21, 2024

The best business continuity planning happens before an incident takes place, but IT teams can use examples of others’ failure to bolster their own planning.

No one likes publicizing their mistakes, and organizations that experience a business continuity crisis are no different. Because each business continuity failure presents a learning opportunity for other businesses, it’s unfortunate that real-life examples can be hard to track down — that is, unless the organization has a high-enough profile for the issue to make the news.

Although IT teams won’t be able to read a news article and understand a particular company’s business continuity plan and how it helps critical business functions continue in the event of serious disruption or disaster, they can use such failures to see the aspects of a company’s plan that were likely missing or followed incorrectly.

Below are four examples of major business continuity failures, how they happened and what IT teams can do to prevent the same thing from happening at their organizations.

FAA system failure causes U.S. ground stop

On Jan. 11, 2023, thousands of flights across the U.S. were grounded due to an hourslong Federal Aviation Administration (FAA) system outage of the Notice to Air Missions (NOTAM) database. NOTAM is a critical system that pilots must consult before takeoff to inform them of hazards and runway closures.

The NOTAM system is also old.

While the FAA said the root issue was a deleted file, the outage time could have been significantly reduced if the legacy infrastructure had offered the high availability of more up-to-date systems. It might be a tall order to replace a longstanding, internationally used system such as NOTAM, but organizations that are resistant to replacing existing systems can learn from this business continuity failure. Outdated systems that prevent implementing current standards and recovery times make business continuity more difficult than it already is.

Lessons: IT teams in organizations that — for whatever reason — cannot replace outdated legacy systems should prioritize business continuity strategies such as knowing how to test without interrupting operations, finding high availability processes and verifying backup integrity. They can also point to high-profile incidents such as the FAA system outage as evidence for new system needs.

Business continuity planning lifecycle diagram.
Organizations must regularly review and update business continuity plans to ensure their effectiveness.

Microsoft Azure/Office outage halts users internationally

Also in January 2023, Microsoft had a major outage that affected users across the globe, but especially in Europe.

The outage left many business and personal users unable to access email and files or manage Azure infrastructure. The root cause was eventually tracked down to a bad routing change Microsoft made to its core routing infrastructure.

Lessons: Unfortunately, no one-size-fits-all fix for cloud computing exists. Larger businesses can mitigate outages by using multiple zones. In that situation, each region has multiple data centers that are hundreds of miles away from each other and share no resources, so loss of a single zone does not take down the environment.

Smaller companies might find it is more useful to use built-in disaster recovery tools, such as those in Azure, to completely fail over and get back up running quickly. This does require some preplanning, but does not require the complexity and cost of a multizone setup with redundancy.

Larger organizations with higher availability requirements can instead use the availability features to handle a downed data center by having redundancy and rerouting of traffic.

Fire damages OVHcloud’s data center — and reputation

Not even the biggest companies with endless resources can prevent natural disasters from occurring. In the case of extreme weather, business continuity is a matter of being prepared. Unfortunately, OVHcloud was not.

In March 2021, one of the cloud provider’s data centers caught fire, and the fire suppression measures were not up to the job. Many clients woke up to find their rented servers offline. To make things worse, one of the backup arrays was completely destroyed in the fire, losing critical backups that the service provider could have used to recover customer data.

This crisis did not only affect immediate business functions — OVHcloud’s reputation suffered due to the outage, and it was the subject of a $10 million class-action lawsuit from more than 140 of its clients.

Lessons: The OVHcloud business continuity failure illustrates the importance of the 3-2-1 rule of data backup. Multiple backups, on different hardware, in different locations is the most surefire way to ensure data is safe in a fire or natural disaster. That way, if the data center is destroyed, there is still a data backup elsewhere that the client can restore to get services working again.

Ransomware compromises NHS Foundation Trust

The National Health Service (NHS) is one of the largest employers in the U.K. Downtime costs significant money and endangers public healthcare, making the Aug. 4, 2022, ransomware attack on the NHS a prime example of a disastrous business continuity failure.

The attack, which targeted a major software provider for the NHS, took several months to remediate fully. During the initial stages, the front-line staff had to revert to pen and paper, and make do with whatever records they had that were not computer-based. Part of the delay in service restoration was the impact to legacy systems.

However, there was a bigger problem with this failure: hidden shadow IT systems installed by employees with little to no professional IT oversight.

Lessons: Legacy IT systems frequently incur a higher maintenance cost and are more likely to be neglected when it comes to maintenance and updates. It is easier said than done, but one way to avoid these issues is replacing legacy systems.

Organizations must also have strict policies regarding the acquisition and management of IT systems and software. Any purchase must be tightly managed and done in conjunction with IT staff approval, since they are often aware of issues that less technically savvy managers might not know about.

link

Leave a Reply

Your email address will not be published. Required fields are marked *