Should Companies Be Held Liable For Software Flaws?

Following the CrowdStrike event two weeks ago, there has been an interesting exchange between Delta Airlines and CrowdStrike. In particular, Delta has threatened to sue CrowdStrike to pursue compensation for the estimated $500M of losses allegedly incurred during the outage. CrowdStrike has recently hit back at Delta claiming the airline’s recovery efforts took far longer than their peers and other companies impacted by the outage. This entire exchange prompts some interesting questions about whether a technology company should be held liable for flaws in their software and where the liability should start and end.

Strategic Technology Trends

Software quality, including defects that lead to vulnerabilities, has been identified as a strategic imperative according to CISA and the Whitehouse in the 2023 National Cybersecurity Strategy. Specifically, the United States wants to “shift liability for software products and services to promote secure development practices” and it would seem the CrowdStrike event falls into this category of liability and secure software development practices.

In addition to strategic directives, I am also seeing companies prioritize speed to market over quality (and even security). In some respects it makes sense to prioritize speed, particularly when pushing updates for new detections. However, there is clearly a conflict in priorities when a company optimizes for speed over quality for a critical detection update that causes an impact larger than if the detection update had not been pushed at all. Modern cloud infrastructure and software development practices prioritize speed to market over all else. Hyperscale cloud providers have made a giant easy button that allows developers to consume storage, network and compute resources without consideration for the down stream consequences. Attempts by the rest of the business to introduce friction, gates or restrictions on these development processes are met with derision and usually follow accusations of slowing down the business or impeding sales. Security often falls in this category of “bad friction” because they are seen as the “department of no”, but as the CrowdStrike event clearly shows, there needs to be a balance between speed and quality in order to effectively manage risk to the business.

One last trend is the reliance on “the cloud” as the only BCP / DR plan. While cloud companies certainly market themselves as globally available services, they are not without their own issues. Cloud environments still need to follow IT operations best practices by completing a business impact analysis and implementing a BCP / DR plan. At the very least, cloud environments should have a rollback option in order to revert to the last known good state.

…as the CrowdStrike event clearly shows, there needs to be a balance between speed and quality in order to effectively manage risk to the business.

What Can Companies Do Differently?

Companies that push software updates, new services or new products to their customers need to adopt best practices for quality control and quality assurance. This means rigorously testing your products before they hit production to make sure they are as free of defects as possible. CrowdStrike clearly failed to properly test their update due to a claimed flaw in their testing platform. While it is nice to know why the defect made it into production, CrowdStrike still has a responsibility to make sure their products are free from defects and should have had additional testing and observability in place.

Second, for critical updates (like detections), there is an imperative by companies to push the update globally as quickly as possible. Instead, companies like CrowdStrike should prioritize customers in terms of industry risk. They should then create a phased rollout plan that stages their updates with a ramping schedule. By starting small, monitoring changes and then ramping up the rollout, CrowdStrike could have minimized the impact to a handful of customers and avoided a global event.

Lastly, companies need to implement better monitoring and BCP / DR for their business. In the case of CrowdStrike, they should have had monitoring in place that immediately detected their products going offline and they should have had the ability to roll back or revert to the last known good state. Going a step further they could even change the behavior of their software where instead of causing a kernel panic that crashes the system, the OS recovers gracefully and automatically rolls back to the last known good state. However, the reality is sophisticated logic like this costs money to develop and it is difficult for development teams to justify this investment unless the company has felt a financial penalty for their failures.

Cloud environments still need to follow IT operations best practices by completing a business impact analysis and implementing a BCP / DR plan.

Contracts & Liability

Speaking of financial penalties, the big question is whether or not CrowdStrike can be held liable for the global outage. My guess is this will depend on what it says in their contracts. Most contracts have a clause that limits liability for both sides and so CrowdStrike could certainly face damages within those limits (probably only a few million at most). It is more likely CrowdStrike will face losses for new customers and existing customers that are up for contract renewal. Some customers will terminate their contracts. Others will negotiate better terms or expect larger discounts on renewal to make up for the outage. At most this will hit CrowdStrike for the next 3 to 5 years (depending on contract length) and then the pricing and terms will bounce back. It will be difficult for customers to exit CrowdStrike en masse because it is already a sunk cost and companies wont want to spend the time or energy to deploy a new technology. Some of the largest customers may have the best terms and ability to extract concessions from CrowdStrike, but overall I don’t think this will impact them for very long and I don’t think they will be held legally liable in any material sense.

Delta Lags Industry Standard

If CrowdStrike isn’t going to be held legally liable, what happens to Delta and their claimed lost $500M? Let’s look at some facts. First, as CrowdStrike has rightfully pointed out, Delta lagged the world for recovering from this event. They took about 20 times longer to get back to normal operations than other airlines and large companies. This points to clear underinvestment in identifying critical points of failure (their crew scheduling application) and developing sufficient plans to backup and recover if critical parts of their operation failed.

Second, Delta clearly hasn’t designed their operations for ease of management or resiliency. They have also failed to perform an adequate Business Impact Analysis (BIA) or properly test their BCP / DR plans. I don’t know any specifics about their underlying IT operations, but a few recommendations come to mind such as implementing active / active instances for critical services and moving to thin clients or PXE boot for airport kiosks and terminals. Remove the need for a human to touch any of these systems physically, and instead implement processes to remotely identify, manage and recover these systems from a variety of different failure scenarios. Clearly Delta has a big gap in their IT Operations processes and their customers suffered as a result.

Wrapping Up

What the CrowdStrike event highlights is the need for companies to prioritize quality, resiliency and stability over speed to market. The National Cybersecurity Strategy has identified software defects as a strategic imperative because they lead to vulnerabilities, supply chain compromise and global outages. Companies with the size and reach of CrowdStrike can no longer afford to prioritize speed over all else and instead need to shift to a more mature and higher quality SDLC. In addition, companies that use popular software need to consider diversifying their supply chain, implementing IT operations best practices (like SRE) and implementing a mature BCP and DR plan on par with industry standards.

What the CrowdStrike event highlights is the need for companies to prioritize quality, resiliency and stability over speed to market.

When it comes to holding companies liable for global outages, like the one two weeks ago, I think it will be difficult for this to play out in the courts without resorting to a legal tit-for-tat that no one wins. Instead, the market and customers need to weigh in and hold these companies accountable through share prices, contractual negotiation or even switching to a competitor. Given the complexity of modern software, I don’t think companies should be held liable for software flaws because it is impossible to eliminate all flaws. Additionally, modern SDLCs and CI/CD pipelines are exceptionally complex and this complexity can often result in failure. This is why BCP/DR and SRE is so important, so you can recover quickly if needed. Yes, CrowdStrike could have done better, but clearly Delta wasn’t even meeting industry standards. Instead of questioning whether companies should be held liable for software flaws, a better question is: At what point does a company become so essential that they by default become critical infrastructure?

Navigating Hardware Supply Chain Security

Lately, I’ve been thinking a lot about hardware supply chain security and how the risks and controls differ from software supply chain security. As a CSO, one of your responsibilities is to ensure your supply chain is secure, yet the distributed nature of our global supply chain makes this a challenging endeavor. In this post I’ll explore how a CSO should think about the risks of hardware supply chain security, how they should think about governing this problem and some techniques for implementing security assurance within your hardware supply chain.

What Is Hardware Supply Chain?

Hardware supply chain relates to the manufacturing, assembly, distribution and logistics of physical systems. This includes the physical components and the underlying software that comes together to make a functioning system. A real world example could be something as complex as an entire server or something as simple as a USB drive. Your company can be at the start of the supply chain by sourcing and producing raw materials like copper and silicon, at the middle of the supply chain producing individual components like microchips, or at the end of the supply chain assembling and integrating components into an end product for customers.

What Are The Risks?

There are a lot of risks when it comes to the security of hardware supply chains. Hardware typically has longer lead times and longer shelf life than software. This means compromises can be harder to detect (due to all the stops along the way) and can persist for a long time (e.g. decades in cases like industrial control systems). It can be extremely difficult or impossible to mitigate a compromise in hardware without replacing the entire system (or requiring downtime), which is costly to a business or deadly to a mission critical system.

The risk of physical or logical compromise can happen in two ways – interdiction and seeding. Both involve physically tampering with a hardware device, but occur at different points in the supply chain. Seeding occurs during the physical manufacture of components and involves someone inserting something malicious (like a backdoor) into a design or component. Insertion early in the process means the compromise can persist for a long period of time if it is not detected before final assembly.

Interdiction happens later in the supply chain when the finished product is being shipped from the manufacturer to the end customer. During interdiction the product is intercepted en route, opened, altered and then sent to the end customer in an altered or compromised state. The hope is the recipient won’t detect the slight shipping delay or the compromised product, which will allow anything from GPS location data to full remote access.

Governance

CSOs should take a comprehensive approach to manage the risks associated with hardware supply chain security that includes policies, processes, contractual language and technology.

Policies

CSOs should establish and maintain policies specifying the security requirements at every step of the hardware supply chain. This starts at the requirements gathering phase and includes design, sourcing, manufacturing, assembly and shipping. These policies should align to the objectives and risks of the overall business with careful consideration for how to control risk at each step. An example policy could be your business requires independent validation and verification of your hardware design specification to make sure it doesn’t include malicious components or logic. Or, another example policy can require all personnel who physically manufacture components in your supply chain receive periodic background checks.

Processes

Designing and implementing secure processes can help manage the risks in your supply chain and CSOs should be involved in the design and review these processes. Processes can help detect compromises in your supply chain and can create or reduce friction where needed (depending on risk). For example, if your company is involved in national security programs you may establish processes that perform verification and validation of components prior to assembly. You also may want to establish robust processes and security controls related to intellectual property (IP) and research and development (R&D). Controlling access to and dissemination of IP and R&D can make it more difficult to seed or interdict hardware components later on.

Contractual Language

An avenue CSOs should regularly review with their legal department are the contractual clauses used by your company for the companies and suppliers in your supply chain. Contractual language can extend your security requirements to these third parties and even allow your security team to audit and review their manufacturing processes to make sure they are secure.

Technology

The last piece of governance CSOs should invest in is technology. These are the specific technology controls to ensure physical and logical security of the manufacturing and assembly facilities that your company operates. Technology can include badging systems, cameras, RFID tracking, GPS tracking, anti-tamper controls and even technology to help assess the security assurance of components and products. The technologies a CSO selects should complement and augment their entire security program in addition to normal security controls like physical security, network security, insider threat, RBAC, etc.

Detecting Compromises

One aspect of hardware supply chain that is arguably more challenging than software supply chain is detection of compromise. With the proliferation of open source software and technologies like sandboxing, it is possible to review and understand how a software program behaves. Yet, it is much more difficult to do this at the hardware layer. There are some techniques that I have discovered while thinking about and researching this problem and they all relate back to how to detect if a hardware component has been compromised or is not performing as expected.

Basic Techniques

Some of the more simple techniques for detecting if hardware has been modified is via imaging. After the design and prototype is complete you can image the finished product and then compare all products produced against this image. This can tell you if the product has had any unauthorized components added or removed, but it won’t tell you if the internal logic has been compromised.

Another technique for detecting compromised components is similar to unit testing in software and is known as functional verification. In functional verification, individual components have their logic and sub-logic tested against known inputs and outputs to verify they are functioning properly. This may be impractical to do with every component if they are manufactured at scale so statistical sampling may be needed to probabilistically ensure all of the components in a batch are good. The assumption here is if all of your components pass functional verification or statistic sampling then the overall system has the appropriate level of integrity.

To detect interdiction or logistics compromises companies can implement logistics tracking such as unique serial numbers (down to the component level), tamper evident seals, anti-tamper technology that renders the system inoperable if tampered with or makes it difficult to tamper with something without destroying it and even shipping thresholds to detect shipping delay abnormalities.

Advanced Techniques

More advanced detection techniques for detecting compromise can include destructive testing. Similar to statistical sampling, destructive testing involves physically breaking apart a component to make sure nothing malicious has been inserted. Destructive testing makes sure the component was physically manufactured and assembled properly.

In addition to destructive testing, companies can create hardware signatures that include expected patterns of behavior for how a system should physically behave. This is a more advanced method of functional testing where multiple components or even finished products are analyzed together for known patterns of behavior to make sure they are functioning as designed and not compromised. Some hardware components that can assist with this validation are technologies like Trusted Platform Modules (TPM).

Continuing with functional operation, a more advanced method of security assurance for hardware components is function masking and isolation. Function masking attempts to mask a function so it is more difficult to reverse engineer the component. Isolation limits how components can behave with other components and usually has to be done at the design level, which effectively begins to sandbox components at the hardware level. Isolation could rely on TPM to limit functionality of components until the integrity of the system can be verified, or it could just limit functionality of one component with another.

Lastly, one of the most advanced techniques for detecting compromise is called 2nd order analysis and validation. 2nd order analysis looks at the byproduct of the component when it is operating by looking at things like power consumption, thermal signatures, electromagnetic emissions, acoustic properties and photonic (light) emissions. These 2nd order emissions can be analyzed to see if they are within expected limits and if not it could indicate the component is compromised.

Wrapping Up

Hardware supply chain security is a complex space given the distributed nature of hardware supply chains and the variety of attack vectors spanning physical and logical realms. A comprehensive security program needs to weigh the risks of supply chain compromise against the risks and objectives of the business. For companies that operate in highly secure environments, investing in advanced techniques ranging from individual component testing to logistics security is absolutely critical and can help ensure your security program is effectively managing the risks to your supply chain.

References:

Guarding Against Supply Chain Attacks Part 2 (Microsoft)

Long-Term Strategy for DoD Trusted Foundry Needs (ITEA)

Software Supply Chain Security Considerations

Over the past five years there has been increased scrutiny on the security of supply chains and in particular software supply chains. The Solar Winds attack in 2020 brought this issue to the foreground as yet another requirement for a well rounded security program and also has been codified into several security guidelines such as, the Biden Administration Executive Order in 2021 and CISA software supply chain best practices. As businesses shift their software development practices to DevOps and Cloud, CSOs need to make sure software security is one of the components that is measured, monitored and controlled as part of a well rounded security program.

How Did We Get Here?

The use of open source software has increased over the past two decades largely because it is cheaper and faster to use an existing piece of software rather than spend time developing it yourself. This allows software development teams quicker time to market because they don’t have to re-invent software to perform certain functions and can instead focus on developing intellectual property that creates distinct value for their company.

There is also allegedly an implicit advantage to using open source software. The idea being open sourced software has more eyes on it and therefore is less prone to having malicious functions built into it and less prone to security vulnerabilities within the software. This concept may work well for large open source software projects that have a large number of contributors, but the concept falls short for smaller projects that have fewer contributors and resources. However, until the Solar Winds hack in 2020 the general attitude that open source software is more secure was generally applied to all projects regardless of size and funding. As we have learned, this flawed reasoning does not hold up and has allowed for attackers to target companies through their software supply chain.

The challenge with open source software is it is supposed to be a community led project. This means people use the software, but are also supposed to contribute back to that project. However, as companies have embraced open source software the two way street is more biased towards taking and using the software than contributing back. If corporations contributed back in ways that were proportionate to their use of open source software, the maturity, security and quality of the open source software would be drastically improved.

What Are the Risks?

There are several inherent risks involved in using open source software and they are as follows:

Can You Really Trust The Source?

How do you know the software you are pulling down from the internet doesn’t have a backdoor built into it? How do you know it is free from vulnerabilities? Is this piece of software developed and supported by an experienced group of contributors or is it maintained by one person in their basement during their spare time?

The point is, it is fairly easy to masquerade as a legitimate and well supported open source software project. Yet, it is difficult to actually validate the true source of the software.

What Is The Cadence For Vulnerability Fixes And Software Updates?

The size and scope of the open source software project dictates how well it is supported. As we saw during Log4j some projects were able to push updates very quickly, but other smaller projects took time to resolve the full scope of the issue. Any company using open source software should consider how often a project is updated and set limits on the use of software that doesn’t have regular and timely updates.

May Actually Require An Increase In Resources

There are ways for companies to manage the risk of using open source software. Assuming you can trust the source you can always pull down the source code and compile the software yourself. You can even fork the build to include fixes or other improvements to suit your specific application. However, this takes resources. It may not take the full number of resources that would be be required if you wrote the software from scratch, but maintaining the builds, version control and the general Software Development Life Cycle will need to be properly resourced and supported.

Can Complicate External Stakeholder Management

Another issue with using open source software in your supply chain is external stakeholder management. The Biden EO in 2021 requires companies to provide a software bill of materials (SBOM) for software sold to the U.S. Government. This trend has also trickled down into 3rd party partner management between companies, where contractual terms are starting to ask for software bill of materials, vulnerability disclosure timelines and other security practices related to software.

One major issue with this is: it is possible for software to be listed as vulnerable even though there may be no way to exploit it. For example, a piece of software developed by a company may include an open source library that is vulnerable, but there is no way to actually exploit that vulnerability. This can cause an issue with external stakeholders, regulators, auditors, etc. when they see a vulnerability listed. These external stakeholders may request the vulnerability be resolved, which could draw resources away from other priorities that are higher risk.

Standardization Is Hard

Finally, standardizing and controlling the use of open source software as part of the SDLC is advantageous from a security perspective, but exceptionally difficult from a practicality perspective. Unwinding the use of high risk or unapproved open source software can take a long time depending on how critical the software is to the application. If developers have to recreate and maintain the software internally that takes development time away from new features or product updates. Similarly, getting teams to adopt standard practices such as only allowing software to be a certain number of revisions out of date, only allowing software from certain sources and preventing vulnerable software from being used all takes time, but will pay dividends in the long run. Especially, with external stakeholder management or creation of SBOMs.

Wrapping Up

Using open source software has distinct advantages such as efficiency and time to market, but carries real risks that can be used as an attack vector into a company or their customers. CSOs need to include software supply chain security as one of the pillars of their security programs to identify, monitor and manage the risk of using open source software. Most importantly, a well robust software security supply chain program should consider the most effective ways to manage risk, while balancing external stakeholder expectations and without inadvertently causing an increase in resources.

Chip War Book Afterthoughts

I recently read Chip War by Chris Miller and found it to be a thought provoking exploration of the global supply chain for semi conductors. Most interesting was the historical context and economic analysis of the complexities of the current semi conductor supply chain and how the United States has wielded this technology as an ambassador of democracy across the globe. This book was particularly interesting when considering the recent efforts by the U.S. Administration to revitalize semi conductor manufacturing in the United States via the CHIP Act. Even though the U.S. maintains control over this industry, their control is waning, which is placing the U.S. at risk of losing military and economic superiority.

The US Leads With Cutting Edge Design & Research

One advantage maintained by the U.S. is it leads the way with the latest chip design and research. The latest computer chip architectures increase computing power by shrinking transistors to smaller and smaller sizes, roughly following Moore’s Law to double the number of transistors per chip every two years. In the late 1970’s, the United States was quick to recognize the military and economic advantages provided by semi conductors. Overnight, bombs became more accurate and computing became more powerful allowing decisions to be made quicker and spawning an entirely new industry based on these chips. However, as the U.S. began to rely more and more on semi conductors, the cost needed to come down. This was achieved by outsourcing the labor to cheaper locations (mainly Asia), which subsequently made these countries reliant on the U.S. demand for chips. This allowed the United States to influence these countries to their advantage.

A Technology with Geo-Political Consequences

One side effect of outsourcing the manufacturing of semi conductors is the supply chain quickly became dispersed across the globe. Leading research was conducted in the United States, specialized equipment was manufactured in Europe and cheap labor in Asia completed the package. Until recently, most of this supply chain was driven by the top chip companies such as AMD, Intel and Nvidia. However, other countries, such as China, have recognized the huge economic and military advantages offered by semi conductors and as a result have started chipping away (pun intended) at the United State’s control of the semi conductor supply chain.

The US Can’t Compete On Manufacturing Costs

Despite the passing of the CHIP Act, the United States faces a significant battle to wrest chip manufacturing from the countries in Asia (and mainly Taiwan). The cost of labor in the United States is significantly higher than other countries. Additionally, countries such as Taiwan, South Korea, Japan, Vietnam and China have heavily subsidized computer chip manufacturing in order to maintain a foothold in the global supply chain. In order to compete, the United States will have to make an extreme effort to bring all aspects of manufacturing into the country including heavy tax breaks and subsidies. This will effectively turn into economic warfare on a global scale as the top chip manufacturing countries attempt to drive down costs in order to be the most attractive location for manufacturing.

Supply Chain Choke Points Are Controlled by the US and its Allies (For Now)

However, driving down costs won’t be easy. The highly specialized equipment required to manufacture chips needs to be refreshed every time there is a new breakthrough. The costs are tremendous and make it difficult to break into the industry. Instead, the U.S. has been focusing on maintaining control of particular aspects of the supply chain and even blocking acquisitions of strategic companies by foreign entities. The United States also exerts pressure on the countries within this global supply chain to allow it to maintain an advantage. Yet, as new countries rise to power (China) and seek to control their own supply chains, these choke points will dwindle. Additionally, as non U.S. allies (frenemies?) gain market share in the chip supply chain, the U.S. and its allies need to consider the security of the chips they are receiving from these countries.

Final Thoughts

Chip War by Chris Miller is a fascinating look into the history and global supply chain of semi conductors. For the past 50 years the United States has maintained military and economic advantages over its rival countries as a result of semi conductors. However, this advantage has been waning over the past two decades. The CHIP Act is recognition that the United States must begin to claw back some of the globalization of the supply chain and bring critical parts of the industry back to the U.S in order to maintain economic and military superiority in the future.