CrowdStrike Outage Takeaway

Outage Incident Takeaway – Prioritize Development Practices And QA

August 9, 2024

A well-defined testing process, throughout every step of a software development life cycle, drastically minimizes the risk of operational disruption and promotes a culture of continuous feedback and improvement.

On July 19, 2024, the world experienced a massive IT outage, dubbed by some as the largest IT disruption in history. The incident was triggered by a faulty update released by cybersecurity company CrowdStrike, and resulted in widespread and severe repercussions, particularly for Microsoft Windows systems.

While most systems are prone to faults in one way or another, some measures can be taken to avert downtime and the damaging consequences that come with it. At the top of the list stands a well-implemented software development life cycle (SDLC) alongside rigorous testing protocols.

Let’s first understand what triggered the outage and then look at how optimized SDLC and QA practices could prevent similar events from happening in the future.

What happened?

At 04:09 UTC on July 19, 2024, CrowdStrike rolled out a sensor configuration update for its security platform Falcon containing a logic error. Customers’ systems running Falcon sensor for Windows version 7.11 and above, which were online between 04:09 and 05:27 UTC and applied the update, suffered from system crashes, leading to the display of the dreaded blue screens of death (BSOD).

Diving into technicalities, the update originally targeted malicious “named pipes” used by common C2 frameworks in cyberattacks. The problematic configuration file in the update (referred to as a “Channel File”) numbered 291 controlled how Falcon evaluated named pipe execution on Windows systems. The logic error in Channel File 291 corrupted the update process which subsequently led to the operating system crashing.

CrowdStrike quickly reacted and addressed the issue by updating the content in Channel File 291. But by the time the fix was out, 8.5 million systems across multiple industries had already momentarily failed and left a trail of financial, operational, and reputational damage in their wake.

QA mishaps

The CrowdStrike incident sparked debate and speculation among experts, creating two camps with different perspectives about what may have caused the outage.

From one side, concerns were raised about CrowdStrike’s internal quality assurance processes. Experts think CrowdStrike should have conducted more comprehensive testing before rolling out updates. Also, it is speculated that the User Acceptance Testing (UAT) phase may have been bypassed, which deviates from standard QA protocols.

On the other side, many customers had automatic updates enabled. In that case, it is safe to assume that the update was not tested in separate staging environments before applying it to public systems.

Despite approaching the incident from different standpoints, both camps agreed on one critical aspect: SDLC best practices should be closely followed to minimize the chance of potential system crashing issues.

Software Development Life Cycle and QA testing

When asked about the incident, PMO Lead at CME Bassel Hassan underlined the importance of implementing a well-structured SDLC to minimize risks associated with any piece of software,

“A robust end-to-end testing methodology ensures that the flow of an application performs as designed from start to finish. It helps identify system dependencies and streamlines the passing of correct information between components. Without it, you’re flying blind.”

“Quality standards is key” Bassel continues. “A quality control team should not test incomplete and broken work, always perceiving the product through the eyes of an end-user environment. If it’s not good enough for them, it’s not good enough for us. High-quality standards should be met across the board. All team members, regardless of their seniority, should adhere to the strict application approach as maintaining industry-level quality level is non-negotiable.”

Bassel then goes through the different environments involved in the SDLC that aim at achieving a crash-free application:

“Let’s start with the development environment. This is where developers create, test, and modify software applications. It’s a sandbox where the code is constantly changing and evolving. It’s used to identify and fix bugs, test new features, and experiment with different techniques.

Next, comes the testing environment, or QA. After development, the application is deployed here where QA teams perform various tests. This environment mimics the production environment as closely as possible so the application will work as expected when it’s released to users.

Then there’s the staging environment, a pre-production environment that’s almost identical to production. It’s used for final testing after QA and before the release to production. It serves as a final check to validate the application in an environment that closely resembles production.

In some cases, there’s a separate environment for UAT. Here, actual users test the application to make sure it meets their requirements and is ready for production. It’s an important step to get real-world feedback before the full release.

And finally, the production environment. That’s the live environment where the application is available to all users. It’s the culmination of all our efforts in development and testing, confirming that the application performs flawlessly for the end users.”

“With development and testing best practices in place, potential issues are addressed on a granular level before the software goes public. This proactive approach to development drastically minimizes the risk of operational disruptions post-release and helps maintain customers’ trust”, concludes Bassel.

Wrapping up

Due to the highly competitive nature of the software industry, time is always held in high regard, and rightfully so. But this often results in the compromise of the most crucial process of all – QA testing. So, practically everyone in the product development community is susceptible to erroneous accidents. Let this outage be a wake-up call that thorough testing is not a luxury but a necessity in ensuring software reliability and security, reinforcing the need for the industry to prioritize quality alongside speed.

CrowdStrike Outage Takeaway – Prioritize QA

Outage Incident Takeaway – Prioritize Development Practices And QA

Techies

Outage Incident Takeaway – Prioritize Development Practices And QA

What happened?

QA mishaps

Software Development Life Cycle and QA testing

Wrapping up

Let’s Reimagine Together!

CrowdStrike Outage Takeaway – Prioritize QA

Outage Incident Takeaway – Prioritize Development Practices And QA

Techies

Outage Incident Takeaway – Prioritize Development Practices And QA

What happened?

QA mishaps

Software Development Life Cycle and QA testing

Wrapping up

Get our latest insights delivered straight to your inbox!

Let’s Reimagine Together!