#bcpdr – 370 Security Blog

A CISO’s Analysis Of the CrowdStrike Global Outage

Overnight from July 18 to July 19, 2024, Windows systems running CrowdStrike ceased functioning and displayed the blue screen of death (BSOD). As people woke up on the morning of July 19th they discovered a wide reaching global outage of the consumer services they rely on for their daily lives, such as healthcare, travel, fast food and even emergency services. The ramifications of this event will continue to be felt for at least the next week as businesses recover from the outage and investors react to the realization that global businesses are extremely fragile when it comes to technology and business operations.

Technical Details

An update by CrowdStrike (CS) to the C-00000291*.sys file dated 0409UTC was pushed to all customers running CS Falcon agents. This file was corrupt (reports indicate a null byte header issue) and when Windows attempted to load this file it crashed. Rebooting the impacted systems does not resolve the issue because of the way CS Falcon works. CS Falcon has access to the inner workings of the operating system (kernel) such as memory access, drivers, and registry entries that allow CS to detect malicious software and activity. The CS Falcon agent is designed to receive updates automatically in order to keep the agent up to date with the latest detections. In this case, the update file was not properly tested and somehow made it through Quality Assurance and Quality Control, before being pushed globally to all CS customers. Additionally, CrowdStrike customers are clearly running CS Falcon on production systems and do not have processes in place to stage updates to CS Falcon in order to minimize the impact of failed updates (more on this below).

Global Impact

This truly is a global outage and the list of industries is far reaching attesting to the success of CS, but also the risks that can impact your software supply chain. As of Monday, Delta airlines is still experiencing flight cancellations and delays as a result of impacts to their pilot scheduling system. The list of impacted companies can be found here, here and here, but I’ll provide a short list as follows:

Travel – United, Delta, American, major airports

Banking and Trading – VISA, stock exchanges

Emergency & Security Services – Some 911 services and ADT

Cloud Providers – AWS, Azure

Consumer – Starbucks, McDonalds, FedEx

Once the immediate global impact subsides, there will be plenty of finger pointing at CrowdStrike for failing to properly test an update, but what this event clearly shows is a lack of investment by some major global companies in site reliability engineering (SRE), business continuity planning (BCP), disaster recovery (DR), business impact analysis (BIA) and proper change control. If companies were truly investing in SRE, BCP, DR and BIA beyond a simple checkbox exercise, this failed update would have been a non-event. Businesses would have simply executed their BCP / DR plan and failed over, or immediately recovered their critical services to get back up and running (which some did). Or, if they are running proper change control along immutable infrastructure they could have immediately rolled back to the last good version with minimal impact. Clearly, more work needs to be done by all of these companies to improve their plans, processes and execution when a disruptive event occurs.

Are global companies really allowing live updates to mission critical software in production without going through proper testing? Or even better, production systems should be immutable, preventing any change to production without being updated in the CI/CD pipeline and then re-deployed. Failed updates became an issue almost two decades ago when Microsoft began patch Tuesday. Companies quickly figured out they couldn’t trust the quality of the patches and instead would test the patches in staging, which runs a duplicate environment to production. While this may have created a short window of vulnerability, it came with the advantages of stability and uninterrupted business operations.

Modern day IT Operations (called Platform Engineering or Site Reliability Engineering) now design production environments to be immutable and somewhat self healing. All changes need to be updated in code and then re-pushed through dev , test and staging environments to make sure proper QA and QC is followed. This minimizes impact from failed code pushes and will also minimize disruption from failed patches and updates like this one. SRE also closely monitors production environments for latency thresholds, availability targets and other operational metrics. If the environment exceeds a specific threshold then it throws alerts and will attempt to self heal by allocating more resources, or by rolling back to the previous known good image.

Ramifications

Materiality

Setting aside maturity of business and IT operations, there are some clear ramifications for this event. First, this had a global impact to a wide variety of businesses and services. Some of the biggest impacts were felt by publicly traded companies and as a result these companies will need to make an 8K filing with the SEC to report a material event to their business. Even though this wasn’t a cybersecurity attack, it was still an event that disrupted business operations and so companies will need to report the expected impact and loss accordingly. CrowdStrike in particular will need to make an 8K filling, not only for loss of stock value, but for expected loss of revenue through lost customers, contractual concessions and other tangible impacts to their business. When I started this post Friday of the even, CS stock was down over 10% and by Monday morning they were down almost 20%. The stock has started to recover, but that is clearly a material event to investors.

Greater Investment In BCP / DR & BIA

Recent events, such as this one and the UHC Change Healthcare ransomware attack, have clearly shown that some business are not investing properly in BCP / DR. They may have plans on paper, but plans still need to be fully tested including rapidly identifying service degradation and implementing recovery operations as quickly as possible. The reality is this should have been a non-event and any business that was impacted longer than a few hours needs to consider additional investment in their BCP / DR plan to minimize the impact of future events. CISOs need to work with the rest of the C-Suite to review existing BCP / DR plans and update them accordingly based on the risk tolerance of the business and desired RTO and RPO.

Boards Need To Step Up

During an event like this one boards need to take a step back and remember their primary purpose is to represent and protect investors. In this case, the sub-committees that govern technology, cybersecurity and risk should be asking hard questions about how to minimize the impact of future events like this and consider if the existing investment in BCP / DR technology and processes is sufficient to offset a projected loss of business. This may include more frequent reports on when the last time BCP / DR plans were properly tested and if those plans are properly accounting for all of the possible scenarios that could impact the business such as ransomware, supply chain disruption or global events like this one. The board may also push the executive staff to accelerate plans to invest in and modernize IT operations to eliminate tech debt and adopt industry best practices such as immutable infra or SRE. The board may also insist on a detailed analysis of the risks of the supply chain, including plans to minimize single points of failure, while limiting the blast radius of future events.

Negative Outcomes

Unfortunately, this event is likely to cause a negative perception of cybersecurity in the short term for a few different reasons. First, the obvious business disruption is one people will be questioning. How, is it a global cybersecurity company is able to disrupt so much with a single update? Could this same process act as an attack vector for attackers? Reports are already indicating that malicious domains have been set up to look like the fix for this event, but instead push malware. There are also malicious domains that have been created for phishing purposes and the reality is any company impacted by this event may also be vulnerable to ransomware attacks, social engineering and other follow on attacks.

Second, this event may cause a negative perception of automatic updates within the IT operations groups. I personally believe this is the wrong reaction, but the reality is some businesses will turn off the auto-updates, which will leave them more vulnerable to malware and other attacks.

The reality is this should have been a non-event and any business that was impacted longer than a few hours needs to consider additional investment in their BCP / DR plan to minimize the impact of future events.

What CISOs Should Do

With all this in mind, what should CISOs do to help the board, the C-Suite and the rest of the business navigate this event? Here are my suggestions:

First, review your contractual terms with 3rd party providers to understand contractually defined SLAs, liability, restitution and other clauses that can help protect your business due to an event caused by a third party. This should also include a risk analysis of your entire supply chain to determine single points of failure and how to protect your business appropriately.

Second, insist on increased investment in your BIA, BCP and DR plans including designing for site reliability and random events (chaos monkey) to proactively identify and recover from disruption, including review of RTO and RPO. If your BCP / DR plan is not where it needs to be, it may require investment in a multi-year technology transformation plan including resolving legacy systems and tech debt. It may also require modernizing your SDLC to shift to CI/CD including dev, test, staging and prod environments that are tightly controlled. The ultimate goal will be to move to immutable infrastructure and IT operations best practices that allow your services to operate and recover without disruption. I’ve captured my thoughts on some of the best practices here.

Third, resist the temptation to over react. The C-Suite and investors are going to ask some hard questions about your business and they will suggest a wide range of solutions such as turning off auto-patches, ripping out CS or even building your own solution. All of these suggestions have a clear tradeoff in terms of risk and operational investment. Making a poor, reactive, decision immediately after this event can harm the business more than it can help.

Finally, for mission critical services consider shifting to a heterogeneous environment that statistically minimizes the impact of any one vendor. The concept is simple, if you need an security technology to protect your systems consider purchasing multiple vendors that all have similar capabilities, but will minimize the impact of your business operations if one of them has an issue. This obviously raises the complexity and operational cost of your environment and should only be used for mission critical or highly sensitive services that need to absolutely minimize any risk to operations. However, this event does highlight the risks of consolidating to a single vendor and you should conduct a risk analysis to determine the best course of action for your business and supply chain.

Wrapping Up

For some companies this was a non-event. Once they realized there was an outage they simply executed their recovery plans and were back online relatively quickly. For other companies, this event highlighted lack of investment in IT operations fundamentals like BCP / DR or supply chain risk management. On the positive side, this wasn’t a ransomware or other cybersecurity attack and so recovery is relatively straightforward for most businesses. On the negative side, this event can have negative consequences if businesses over react and make poor decisions. As a CISO, I highly recommend you take advantage of this event to learn from your weaknesses and make plans to shore up aspects of your operations that were sub-standard.

If Data Is Our Most Valuable Asset, Why Aren’t We Treating It That Way?

There have been several high profile data breaches and ransomware attacks in the news lately and the common theme between all of them has been the disclosure (or threat of disclosure) of customer data. The after effects of a data breach or ransomware attack are far reaching and typically include loss of customer trust, refunds or credits to customer accounts, class action lawsuits, increased cyber insurance premiums, loss of cyber insurance coverage, increased regulatory oversight and fines. The total cost of these after effects far outweigh the cost of implementing proactive security controls like proper business continuity planning, disaster recovery (BCP/DR) and data governance, which begs the question – if data is our most valuable asset, why aren’t we treating it that way?

The Landscape Has Shifted

Over two decades ago, the rise of free consumer cloud services, like the ones provided by Google and Microsoft, ushered in the era of mass data collection in exchange for free services. Fast forward to today, the volume of data growth and the value of that data has skyrocketed as companies have shifted to become digital first or mine that data for advertising purposes and other business insights. The proliferation of AI has also ushered in a new data gold rush as companies strive to train their LLMs on bigger and bigger data sets. While the value of data has increased for companies, it has also become a lucrative attack vector for threat actors in the form of data breaches or ransomware attacks.

The biggest problem with business models that monetize data is: security controls and data governance haven’t kept pace with the value of the data. If your company has been around for more than a few years chances are you have a lot of data, but data governance and data security has been an afterthought. The biggest problem with bolting on security controls and data governance after the fact is it is hard to reign in pandoras box. This is also compounded by the fact that it is hard to put a quantitative value on data, and re-architecting data flows is seen as a sunk cost to the business. The rest of the business may find it difficult to understand the need to rearchitect their entire business IT operations since there isn’t an immediate and tangible business benefit.

Finally, increased global regulation is changing how data can be collected and governed. Data collection is shifting from requiring consumers to opt-out to requiring them to explicitly opt-in. This means consumers and users (an their associated data) will no longer be the presumptive product of these free services without their explicit consent. Typically, increased regulation also comes with specific requirements for data security, data governance and even data sovereignty. Companies that don’t have robust data security and data governance are already behind the curve.

False Sense Of Security

In addition to increased regulation and a shifting business landscape, the technology for protecting data really hasn’t changed in the past three decades. However, few companies implement effective security controls on their data (as we continue to see in data breach notifications and ransomware attacks). A common technology used to protect data is encryption at rest and encryption in transit (TLS), but these technologies are insufficient to protect data from anything except physical theft and network snooping (MITM). Both provide a false sense of security related to data protection.

Furthermore, common regulatory compliance audits don’t sufficiently specify protection of data throughout the data lifecycle beyond encryption at rest, encryption in transit and access controls. Passing these compliance audits can give a company a false sense of security that they are sufficiently protecting their data, when the opposite is true.

Just because you passed your compliance audit, doesn’t mean you are good to go from a data security and governance perspective.

Embrace Best Practices

Businesses can get ahead of this problem to make data breaches and ransomware attacks a non-event by implementing effective data security controls and data governance, including BCP/DR. Here are some of my recommendations for protecting your most valuable asset:

Stop Storing and Working On Plain Text Data

Sounds simple, but this will require significant changes to business processes and technology. The premise is the second data hits your control it should be encrypted and never, ever, unencrypted. This means data will be protected even if an attacker accesses the data store, but it also will mean the business will need to figure out how to modify their operations to work on encrypted data. Recent technologies such as homomorphic encryption have been introduced to solve these challenges, but even simpler activities like tokenizing the data can be an effective solution. Businesses can go one step further and create a unique cryptographic key for every “unique” customer. This would allow for simpler data governance, such as deletion of data.

Be Ruthless With Data Governance

Storage is cheap and it is easy to collect data. As a result companies are becoming digital data hoarders. However, to truly protect your business you need to ruthlessly govern your data. Data governance policies need to be established and technically implemented before any production data touches the business. These policies need to be reviewed regularly and data should be purged the second it is no longer needed. A comprehensive data inventory should be a fundamental part of your security and privacy program so you know where the data is, who owns it and where the data is in the data lifecycle.

The biggest problem with business models that monetize data is: security controls and data governance haven’t kept pace with the value of the data.

Ruthlessly governing data can have a number of benefits to the business. First, it will help control data storage costs. Second, it will minimize the impact of a data breach or ransomware attack to the explicit time period you have kept data. Lastly, it can protect the business from liability and lawsuits by demonstrating the data is properly protected, governed and/or deleted. (You can’t disclose what doesn’t exist).

Implement An Effective BCP/DR and BIA Program

Conducting a proper Business Impact Analysis (BIA) of your data should be table stakes for every business. Your BIA should include what data you have, where it is and most importantly, what would happen if this data wasn’t available? Building on top of the BIA should be a comprehensive BCP/DR plan that appropriately tiers and backs up data to support your uptime objectives. However, it seems like companies are still relying on untested BCP/DR plans or worse solely relying on single cloud regions for data availability.

Every BCP/DR plan should include a write once, read many (WORM) backup of critical data that is encrypted at the object or data layer. Create WORM backups to support your RTO and RPO and manage the backups according to your data governance plan. Having a WORM backup will prevent ransomware attacks from being able to encrypt the data and if there is a data breach it will be meaningless because the data is encrypted. BCP / DR plans should be regularly tested (up to full business failover) and security teams need to be involved in the creation of BCP/DR plans to make sure the data will have the confidentiality, integrity and availability when needed.

Don’t Rely On Regulatory Compliance Activities As Your Sole Benchmark

My last recommendation for any business is – just because you passed your compliance audit, doesn’t mean you are good to go from a data security and governance perspective. Compliance audits exist as standards for specific industries to establish a minimum bar for security. Compliance standards can be watered down due to industry feedback, lobbying or legal challenges and a well designed security program should be more comprehensive than any compliance audit. Furthermore, compliance audits are typically tailored to specific products and services, have specific scopes and limited time frames. If you design your security program to properly manage the risks to the business, including data security and data governance, you should have no issues passing a compliance audit that assesses these aspects.

Wrapping Up

Every business needs to have proper data security and data governance as part of a comprehensive security program. Data should never be stored in plain text and it should be ruthlessly governed so it is deleted the second it is no longer needed. BCP/DR plans should be regularly tested to simulate data loss, ransomware attacks or other impacts to data and, while compliance audits are necessary, they should not be the sole benchmark for how you measure the effectiveness of your security program. Proper data protection and governance will make ransomware and data breaches a thing of the past, but this will only happen if businesses stop treating data as a commodity and start treating it as their most valuable asset.