"Anything that can go wrong will go wrong."- Murphy's law
During World War II, researchers at the Center for Naval Analyses conducted a study to assess the damage done to returned aircraft after missions. The initial inclination was to add armor to the areas that showed the most damage as those were perceived as the most vulnerable parts. However, Abraham Wald, a mathematician, observed that the study was only conducted on the aircraft that had withstand the attacks (represented by the red dots in the figure) and still return safely. However, the areas that remained unscathed such as the cockpit and engine, were overlooked, potentially obscuring a different narrative. These unaffected are those areas that if hit, would case the plane to crash and be lost.
Image: Wikipedia
False assumptions can undermine our recovery capabilities. If the aircraft were reinforced in the most hit areas, this would be a result of survivorship bias because crucial data from fatally damaged planes was missing while making assessment. Consequently, we run the risk of overlooking significant factors and failing to cover the ‘unknowns’ that may impact our operations.
The answer to this question cannot be simplified to a simple yes or no. In the current dynamic business environment, where any amount of downtime is unacceptable, even a brief interruption can result in billions of dollars in losses. The repercussions of such downtime extend beyond financial implications and encompass reduced productivity, damage to reputation, low customer confidence, mental stress and more.
Regardless of whether your operations are on-premises or in the cloud, system outages are an inevitable reality. They can arise from several factors, including hardware, software, human error, system malfunctions, natural disasters, and others. This issue is not new; we have encountered it repeatedly in the past and will encounter it in the future as well. Failures occur all the time, and while complete eradication may be unattainable, we must acknowledge and embrace them to enhance our preparedness.
According to Uptime’s 2022 Data Center Resiliency Survey
Networking-related problems have been the single biggest cause of all IT service downtime incidents – regardless of severity – over the past three years. Outages attributed to software, network and systems issues are on the rise due to complexities from the increasing use of cloud technologies, software-defined architectures and hybrid, distributed architectures.
Nearly 40% of organizations have suffered a major outage caused by human error over the past three years. Of these incidents, 85% stem from staff failing to follow procedures or from flaws in the processes and procedures themselves.
Resilience is often underestimated, but its significance cannot be emphasized enough. We all have experts who have helped us build a stable and functioning system following the best practices, but no architecture is fail safe.
We may have ensured high availability, replication, geographic redundancy, backup strategies, BCDR strategy etc., but are we still truly confident in their resilience? Merely assuming everything is taken care of can be a risky mindset.
Before we dive in, let us make sure we understand some terms very clearly:
The Oxford Dictionary definition of reliability is "the quality of being trustworthy or of performing consistently well," whereas resilience is "the capacity to recover quickly from difficulties."
In the world of cloud computing, reliability means that the services should run as they are intended to run at any given point in time, whereas the resilience of the service means being able to withstand certain types of failure and yet remain functional from the customer's perspective. In other words, reliability is the result we strive, and resiliency is the way to achieve it.
It is a set of strategies, policies, and procedures that help an organization respond, adapt, continue its essential operations, and recover in the event of a disruptive event. While BC (Business Continuity) deals with the business processes and functions, DR (Disaster Recovery) is primarily focused on the recovery from the IT side.
It refers to the point in time in the past to which you will recover.
It refers to the point in time in the future at which you will be up and running again.
Let us understand this with an analogy:
There is a baker who bakes pies that takes 2 hours, using an oven that runs continuously. One day, the oven breaks while baking, and the pies get ruined. To get back to baking, the baker has another oven available, but it needs a 1-hour preheat time.
RPO is 2 hours because the damaged set of pies represents a loss that the business must accept.
RTO is 1 hour, which is the time it takes to resume baking after preheating the second oven.
A backup strategy is essential for data protection, maintaining business continuity, compliance and legal requirement, disaster recovery, and overall peace of mind. Many solutions exist for backing up data, including hardware-based, software-based, and cloud-based methods. Cloud providers offer different backup plans with strategies like incremental, full, automated, hybrid, and multi-cloud backups. Assessing your needs helps you choose the right backup strategy for you.
It is the discipline of experimenting with a system in order to build confidence in the system's capability to withstand turbulent conditions in production. A harsh way to ensure that your failure recovery is working correctly is to intentionally crash your production servers.
There are many tools available on the market to assist with resilience testing like Netflix famous chaos monkey, AWS FIS (fault injection simulator), Azure Chaos studio and many others.
So, iterating again Murphy’s principle "Anything that can go wrong will go wrong.” Are you prepared for your next outage?
While no one desires an outage, it is prudent to be prepared for the unexpected in our ever-changing world. It is wise to be proactive before such events take us by surprise. These strategies are designed to build trust in our applications, allowing us to detect any unforeseen problems at the earliest stages. Prioritizing resilience is vital for sustaining seamless operations, safeguarding customer trust, and securing long-term business prosperity.
We at Tietoevry have all the expertise to ensure your business is safe and secure with us. Whether you are a small business or someone in the transformation journey, our utmost responsibility lies in safeguarding the trust of your customers while you place your trust in us.
I hope you enjoyed reading this!
Charu Upadhyay
Office: Stockholm, Sweden
Tietoevry since: 2022
Background: Bachelor’s degree in engineering and information technology from The NorthCap University in Gurgaon, India
Fun-fact: Turns out to be somewhat of a natural when it comes to ice-skating.