Should Companies Be Held Liable For Software Flaws?

Following the CrowdStrike event two weeks ago, there has been an interesting exchange between Delta Airlines and CrowdStrike. In particular, Delta has threatened to sue CrowdStrike to pursue compensation for the estimated $500M of losses allegedly incurred during the outage. CrowdStrike has recently hit back at Delta claiming the airline’s recovery efforts took far longer than their peers and other companies impacted by the outage. This entire exchange prompts some interesting questions about whether a technology company should be held liable for flaws in their software and where the liability should start and end.

Strategic Technology Trends

Software quality, including defects that lead to vulnerabilities, has been identified as a strategic imperative according to CISA and the Whitehouse in the 2023 National Cybersecurity Strategy. Specifically, the United States wants to “shift liability for software products and services to promote secure development practices” and it would seem the CrowdStrike event falls into this category of liability and secure software development practices.

In addition to strategic directives, I am also seeing companies prioritize speed to market over quality (and even security). In some respects it makes sense to prioritize speed, particularly when pushing updates for new detections. However, there is clearly a conflict in priorities when a company optimizes for speed over quality for a critical detection update that causes an impact larger than if the detection update had not been pushed at all. Modern cloud infrastructure and software development practices prioritize speed to market over all else. Hyperscale cloud providers have made a giant easy button that allows developers to consume storage, network and compute resources without consideration for the down stream consequences. Attempts by the rest of the business to introduce friction, gates or restrictions on these development processes are met with derision and usually follow accusations of slowing down the business or impeding sales. Security often falls in this category of “bad friction” because they are seen as the “department of no”, but as the CrowdStrike event clearly shows, there needs to be a balance between speed and quality in order to effectively manage risk to the business.

One last trend is the reliance on “the cloud” as the only BCP / DR plan. While cloud companies certainly market themselves as globally available services, they are not without their own issues. Cloud environments still need to follow IT operations best practices by completing a business impact analysis and implementing a BCP / DR plan. At the very least, cloud environments should have a rollback option in order to revert to the last known good state.

…as the CrowdStrike event clearly shows, there needs to be a balance between speed and quality in order to effectively manage risk to the business.

What Can Companies Do Differently?

Companies that push software updates, new services or new products to their customers need to adopt best practices for quality control and quality assurance. This means rigorously testing your products before they hit production to make sure they are as free of defects as possible. CrowdStrike clearly failed to properly test their update due to a claimed flaw in their testing platform. While it is nice to know why the defect made it into production, CrowdStrike still has a responsibility to make sure their products are free from defects and should have had additional testing and observability in place.

Second, for critical updates (like detections), there is an imperative by companies to push the update globally as quickly as possible. Instead, companies like CrowdStrike should prioritize customers in terms of industry risk. They should then create a phased rollout plan that stages their updates with a ramping schedule. By starting small, monitoring changes and then ramping up the rollout, CrowdStrike could have minimized the impact to a handful of customers and avoided a global event.

Lastly, companies need to implement better monitoring and BCP / DR for their business. In the case of CrowdStrike, they should have had monitoring in place that immediately detected their products going offline and they should have had the ability to roll back or revert to the last known good state. Going a step further they could even change the behavior of their software where instead of causing a kernel panic that crashes the system, the OS recovers gracefully and automatically rolls back to the last known good state. However, the reality is sophisticated logic like this costs money to develop and it is difficult for development teams to justify this investment unless the company has felt a financial penalty for their failures.

Cloud environments still need to follow IT operations best practices by completing a business impact analysis and implementing a BCP / DR plan.

Contracts & Liability

Speaking of financial penalties, the big question is whether or not CrowdStrike can be held liable for the global outage. My guess is this will depend on what it says in their contracts. Most contracts have a clause that limits liability for both sides and so CrowdStrike could certainly face damages within those limits (probably only a few million at most). It is more likely CrowdStrike will face losses for new customers and existing customers that are up for contract renewal. Some customers will terminate their contracts. Others will negotiate better terms or expect larger discounts on renewal to make up for the outage. At most this will hit CrowdStrike for the next 3 to 5 years (depending on contract length) and then the pricing and terms will bounce back. It will be difficult for customers to exit CrowdStrike en masse because it is already a sunk cost and companies wont want to spend the time or energy to deploy a new technology. Some of the largest customers may have the best terms and ability to extract concessions from CrowdStrike, but overall I don’t think this will impact them for very long and I don’t think they will be held legally liable in any material sense.

Delta Lags Industry Standard

If CrowdStrike isn’t going to be held legally liable, what happens to Delta and their claimed lost $500M? Let’s look at some facts. First, as CrowdStrike has rightfully pointed out, Delta lagged the world for recovering from this event. They took about 20 times longer to get back to normal operations than other airlines and large companies. This points to clear underinvestment in identifying critical points of failure (their crew scheduling application) and developing sufficient plans to backup and recover if critical parts of their operation failed.

Second, Delta clearly hasn’t designed their operations for ease of management or resiliency. They have also failed to perform an adequate Business Impact Analysis (BIA) or properly test their BCP / DR plans. I don’t know any specifics about their underlying IT operations, but a few recommendations come to mind such as implementing active / active instances for critical services and moving to thin clients or PXE boot for airport kiosks and terminals. Remove the need for a human to touch any of these systems physically, and instead implement processes to remotely identify, manage and recover these systems from a variety of different failure scenarios. Clearly Delta has a big gap in their IT Operations processes and their customers suffered as a result.

Wrapping Up

What the CrowdStrike event highlights is the need for companies to prioritize quality, resiliency and stability over speed to market. The National Cybersecurity Strategy has identified software defects as a strategic imperative because they lead to vulnerabilities, supply chain compromise and global outages. Companies with the size and reach of CrowdStrike can no longer afford to prioritize speed over all else and instead need to shift to a more mature and higher quality SDLC. In addition, companies that use popular software need to consider diversifying their supply chain, implementing IT operations best practices (like SRE) and implementing a mature BCP and DR plan on par with industry standards.

What the CrowdStrike event highlights is the need for companies to prioritize quality, resiliency and stability over speed to market.

When it comes to holding companies liable for global outages, like the one two weeks ago, I think it will be difficult for this to play out in the courts without resorting to a legal tit-for-tat that no one wins. Instead, the market and customers need to weigh in and hold these companies accountable through share prices, contractual negotiation or even switching to a competitor. Given the complexity of modern software, I don’t think companies should be held liable for software flaws because it is impossible to eliminate all flaws. Additionally, modern SDLCs and CI/CD pipelines are exceptionally complex and this complexity can often result in failure. This is why BCP/DR and SRE is so important, so you can recover quickly if needed. Yes, CrowdStrike could have done better, but clearly Delta wasn’t even meeting industry standards. Instead of questioning whether companies should be held liable for software flaws, a better question is: At what point does a company become so essential that they by default become critical infrastructure?

Author: Lee Vorthman

I'm a U.S. Navy veteran and the Global Chief Security Officer (CSO) at a Fortune 100 cloud company where I've built a successful security program from the ground up and have partnered with the business to increase trust and reduce risk. I have over 25 years experience across a wide variety of industries such as technology, government & defense, education and oil & gas. I hold a number of professional certifications such as, EC-Council's Certified Chief Information Security Officer (C|CISO), Digital Director's Network (DDN) Board Certified Qualified Technology Expert (QTE) and ISC(2) Certified Information Systems Security Professional (CISSP). Previously I was the Chief Technology Officer (CTO) for Civilian Agencies and Cybersecurity Initiatives at NetApp U.S. Public Sector and the Chief Information Security Office for an Oil & Gas software company. I am available for consulting and speaking opportunities. Thoughts and opinions are my own and do not represent any employer past or present. View all posts by Lee Vorthman

Related

Author: Lee Vorthman

Leave a comment Cancel reply