A CISO’s Analysis Of the CrowdStrike Global Outage

Overnight from July 18 to July 19, 2024, Windows systems running CrowdStrike ceased functioning and displayed the blue screen of death (BSOD). As people woke up on the morning of July 19th they discovered a wide reaching global outage of the consumer services they rely on for their daily lives, such as healthcare, travel, fast food and even emergency services. The ramifications of this event will continue to be felt for at least the next week as businesses recover from the outage and investors react to the realization that global businesses are extremely fragile when it comes to technology and business operations.

Technical Details

An update by CrowdStrike (CS) to the C-00000291*.sys file dated 0409UTC was pushed to all customers running CS Falcon agents. This file was corrupt (reports indicate a null byte header issue) and when Windows attempted to load this file it crashed. Rebooting the impacted systems does not resolve the issue because of the way CS Falcon works. CS Falcon has access to the inner workings of the operating system (kernel) such as memory access, drivers, and registry entries that allow CS to detect malicious software and activity. The CS Falcon agent is designed to receive updates automatically in order to keep the agent up to date with the latest detections. In this case, the update file was not properly tested and somehow made it through Quality Assurance and Quality Control, before being pushed globally to all CS customers. Additionally, CrowdStrike customers are clearly running CS Falcon on production systems and do not have processes in place to stage updates to CS Falcon in order to minimize the impact of failed updates (more on this below).

Global Impact

This truly is a global outage and the list of industries is far reaching attesting to the success of CS, but also the risks that can impact your software supply chain. As of Monday, Delta airlines is still experiencing flight cancellations and delays as a result of impacts to their pilot scheduling system. The list of impacted companies can be found here, here and here, but I’ll provide a short list as follows:

Travel – United, Delta, American, major airports

Banking and Trading – VISA, stock exchanges

Emergency & Security Services – Some 911 services and ADT

Cloud Providers – AWS, Azure

Consumer – Starbucks, McDonalds, FedEx

Once the immediate global impact subsides, there will be plenty of finger pointing at CrowdStrike for failing to properly test an update, but what this event clearly shows is a lack of investment by some major global companies in site reliability engineering (SRE), business continuity planning (BCP), disaster recovery (DR), business impact analysis (BIA) and proper change control. If companies were truly investing in SRE, BCP, DR and BIA beyond a simple checkbox exercise, this failed update would have been a non-event. Businesses would have simply executed their BCP / DR plan and failed over, or immediately recovered their critical services to get back up and running (which some did). Or, if they are running proper change control along immutable infrastructure they could have immediately rolled back to the last good version with minimal impact. Clearly, more work needs to be done by all of these companies to improve their plans, processes and execution when a disruptive event occurs.

Are global companies really allowing live updates to mission critical software in production without going through proper testing? Or even better, production systems should be immutable, preventing any change to production without being updated in the CI/CD pipeline and then re-deployed. Failed updates became an issue almost two decades ago when Microsoft began patch Tuesday. Companies quickly figured out they couldn’t trust the quality of the patches and instead would test the patches in staging, which runs a duplicate environment to production. While this may have created a short window of vulnerability, it came with the advantages of stability and uninterrupted business operations.

Modern day IT Operations (called Platform Engineering or Site Reliability Engineering) now design production environments to be immutable and somewhat self healing. All changes need to be updated in code and then re-pushed through dev , test and staging environments to make sure proper QA and QC is followed. This minimizes impact from failed code pushes and will also minimize disruption from failed patches and updates like this one. SRE also closely monitors production environments for latency thresholds, availability targets and other operational metrics. If the environment exceeds a specific threshold then it throws alerts and will attempt to self heal by allocating more resources, or by rolling back to the previous known good image.

Ramifications

Materiality

Setting aside maturity of business and IT operations, there are some clear ramifications for this event. First, this had a global impact to a wide variety of businesses and services. Some of the biggest impacts were felt by publicly traded companies and as a result these companies will need to make an 8K filing with the SEC to report a material event to their business. Even though this wasn’t a cybersecurity attack, it was still an event that disrupted business operations and so companies will need to report the expected impact and loss accordingly. CrowdStrike in particular will need to make an 8K filling, not only for loss of stock value, but for expected loss of revenue through lost customers, contractual concessions and other tangible impacts to their business. When I started this post Friday of the even, CS stock was down over 10% and by Monday morning they were down almost 20%. The stock has started to recover, but that is clearly a material event to investors.

Greater Investment In BCP / DR & BIA

Recent events, such as this one and the UHC Change Healthcare ransomware attack, have clearly shown that some business are not investing properly in BCP / DR. They may have plans on paper, but plans still need to be fully tested including rapidly identifying service degradation and implementing recovery operations as quickly as possible. The reality is this should have been a non-event and any business that was impacted longer than a few hours needs to consider additional investment in their BCP / DR plan to minimize the impact of future events. CISOs need to work with the rest of the C-Suite to review existing BCP / DR plans and update them accordingly based on the risk tolerance of the business and desired RTO and RPO.

Boards Need To Step Up

During an event like this one boards need to take a step back and remember their primary purpose is to represent and protect investors. In this case, the sub-committees that govern technology, cybersecurity and risk should be asking hard questions about how to minimize the impact of future events like this and consider if the existing investment in BCP / DR technology and processes is sufficient to offset a projected loss of business. This may include more frequent reports on when the last time BCP / DR plans were properly tested and if those plans are properly accounting for all of the possible scenarios that could impact the business such as ransomware, supply chain disruption or global events like this one. The board may also push the executive staff to accelerate plans to invest in and modernize IT operations to eliminate tech debt and adopt industry best practices such as immutable infra or SRE. The board may also insist on a detailed analysis of the risks of the supply chain, including plans to minimize single points of failure, while limiting the blast radius of future events.

Negative Outcomes

Unfortunately, this event is likely to cause a negative perception of cybersecurity in the short term for a few different reasons. First, the obvious business disruption is one people will be questioning. How, is it a global cybersecurity company is able to disrupt so much with a single update? Could this same process act as an attack vector for attackers? Reports are already indicating that malicious domains have been set up to look like the fix for this event, but instead push malware. There are also malicious domains that have been created for phishing purposes and the reality is any company impacted by this event may also be vulnerable to ransomware attacks, social engineering and other follow on attacks.

Second, this event may cause a negative perception of automatic updates within the IT operations groups. I personally believe this is the wrong reaction, but the reality is some businesses will turn off the auto-updates, which will leave them more vulnerable to malware and other attacks.

The reality is this should have been a non-event and any business that was impacted longer than a few hours needs to consider additional investment in their BCP / DR plan to minimize the impact of future events.

What CISOs Should Do

With all this in mind, what should CISOs do to help the board, the C-Suite and the rest of the business navigate this event? Here are my suggestions:

First, review your contractual terms with 3rd party providers to understand contractually defined SLAs, liability, restitution and other clauses that can help protect your business due to an event caused by a third party. This should also include a risk analysis of your entire supply chain to determine single points of failure and how to protect your business appropriately.

Second, insist on increased investment in your BIA, BCP and DR plans including designing for site reliability and random events (chaos monkey) to proactively identify and recover from disruption, including review of RTO and RPO. If your BCP / DR plan is not where it needs to be, it may require investment in a multi-year technology transformation plan including resolving legacy systems and tech debt. It may also require modernizing your SDLC to shift to CI/CD including dev, test, staging and prod environments that are tightly controlled. The ultimate goal will be to move to immutable infrastructure and IT operations best practices that allow your services to operate and recover without disruption. I’ve captured my thoughts on some of the best practices here.

Third, resist the temptation to over react. The C-Suite and investors are going to ask some hard questions about your business and they will suggest a wide range of solutions such as turning off auto-patches, ripping out CS or even building your own solution. All of these suggestions have a clear tradeoff in terms of risk and operational investment. Making a poor, reactive, decision immediately after this event can harm the business more than it can help.

Finally, for mission critical services consider shifting to a heterogeneous environment that statistically minimizes the impact of any one vendor. The concept is simple, if you need an security technology to protect your systems consider purchasing multiple vendors that all have similar capabilities, but will minimize the impact of your business operations if one of them has an issue. This obviously raises the complexity and operational cost of your environment and should only be used for mission critical or highly sensitive services that need to absolutely minimize any risk to operations. However, this event does highlight the risks of consolidating to a single vendor and you should conduct a risk analysis to determine the best course of action for your business and supply chain.

Wrapping Up

For some companies this was a non-event. Once they realized there was an outage they simply executed their recovery plans and were back online relatively quickly. For other companies, this event highlighted lack of investment in IT operations fundamentals like BCP / DR or supply chain risk management. On the positive side, this wasn’t a ransomware or other cybersecurity attack and so recovery is relatively straightforward for most businesses. On the negative side, this event can have negative consequences if businesses over react and make poor decisions. As a CISO, I highly recommend you take advantage of this event to learn from your weaknesses and make plans to shore up aspects of your operations that were sub-standard.

Defining Your Security Front Door

A key skill for any security program is to partner with and enable the business to be successful. CISOs need to ensure their security teams are approachable, reasonable and most importantly balancing the needs of the business against potential security risks. While security teams exist to help protect the business, they don’t own the business systems or processes and as a result need to adopt an advisory or consultative role with the rest of the business to ensure success.

With that in mind, the way the rest of the business finds and engages with the security team can create a good first impression or can set the tone for a difficult interaction. Think of a house that has great curb appeal – it feels inviting and gives the impression that the owners take good care of their property. The same concept exists for the security program, which I call the Security Front Door.

The Front Door Concept

The security front door defines how the rest of the business engages with and interacts with the security team. The front door can be a confluence page, slack channel with pinned information, or some place that is easily discoverable and accessible. Your security front door should clearly lay out information and resources so the rest of the business can either self serve or easily request and receive help when needed.

What Should Be In Your Front Door?

The front door for your security program should include ways to perform the most commonly requested actions from the security team. For example, you probably want really clear ways to request the following:

  • Report an incident
  • Request vulnerability remediation help
  • Request an exception
  • Request an architectural review
  • Dashboards
  • Discover documentation for policies and processes
  • Other – a general way to request help for anything else

The front door is not just a way to make a good first impression and enable the business, but when set up correctly it can actually offload the security team and help the business move faster.

Wrapping Up

The front door is a great way to engage with the business to help them move faster, find information and request assistance from the security team. When done correctly it can allow the rest of the business to self serve and can actually offload the security team by reducing the volume of requests that come in. Setting up the security front door may require a lot of up front work, but by understanding the rest of the business, their key pain points and most commonly requested security asks, you can design a front door that will be a win-win for everyone.

Opsec During Incidents

When I first got into Information Technology over 20 years ago, I started out in networking and data centers. When doing work, we always made sure to have another method of accessing our gear in case a configuration change didn’t work and eliminated access to the equipment or environment. Also, during my military service we would always discuss what would happen in a worst case scenario and have contingency plans in case something went wrong. I’ve carried these principles over into my security career and they have served me well. Given the high likelihood that we are all operating in some type of compromised environment, I think it is good to cover how these principles can be used to maintain operational security during incidents.

Philosophy

Hope for the best, prepare for the worst

No battle plan withstands first contact with the enemy

Train how you fight

There are a lot of things you can do to prepare for a security incident such as, running table top exercises, preparing playbooks and modifying tool sets, but there is an important area that I think is often overlooked during preparation. There is an implicit assumption that all of the fancy tools and environments you have prepared will be available when the incident kicks off. If you are preparing for your incidents with this assumption then you could find your team in trouble. One key component of a realistic table top scenario is to remove capabilities and see how the team reacts. This could be something as simple as saying the incident commander is no longer reachable or something more drastic like not being able to access your SIEM. The point is, the more worst case situations you can prepare for, the better off you will be during a real incident. Here are a few things you should consider to maintain operational security and the ability to respond.

Access To Playbooks

Playbooks are a critical tool for responding to an incident. Incidents can be highly stressful and playbooks exist to help your responders remember all of the steps, access methods, processes, tools, etc. they need to successfully respond. What would you do if an incident kicked off and you couldn’t access your playbooks? Does your team have these playbooks stored in an alternative environment? Do they sync them to an offline, physical storage device that is stored in a safe or passed between people on call?

Playbooks need to not only be accessible via alternate methods, but they also need to be protected. What if the attacker had access to your playbooks? I’m willing to bet your playbooks have a detailed list of all the things your team will do to respond and this can be a useful blueprint for an attacker to follow to avoid detection or expand their footprint. I highly recommend encrypting your playbooks and vaulting the credentials so you can monitor and control access as needed.

Access To Tooling

Tooling is also essential during the incident response process. Tools help teams investigate, gather evidence and ultimately recover from the incident. Access to tools and scripts can help with a number of activities such as, identify hashing algorithms, identifying password encryption types, scanning for open ports or scanning for vulnerabilities. Tools that are useful during incident response should be stored in an image or container that is updated regularly and kept in an offline environment. This will help the incident response team in a few ways. First, it will make sure all of the necessary tools that are needed in the playbook are available even if systems and network access isn’t accessible. Second, it avoids having to pull tools from the internet that haven’t been vetted. In a stressful situation like an incident it is possible to overlook small details that could make your situation worse, such as downloading a tool that has a trojan or backdoor in it. Lastly, it is important to control the flow of information during an incident. If your team is using online tools as part of the investigation they could be leaking information to the internet that can reveal details about the incident, which could not only derail your investigation, but could harm your company brand and reputation.

Another consideration for tooling is how you plan to preserve evidence if you are under active compromise. Having pre-staged environments that are isolated and only used for incidents can be useful. Similarly, having tooling available to image machines, quarantine networks or even encrypt files to send them to an alternate cloud / storage provider can all be useful during an incident.

Access To Communications

Communication during an incident is key. I like to set up a standing War Room where people can come and go as needed. My presence is there to make decisions and help direct resources where they can be most useful. In the chaos of an incident it is important to make sure you have primary, secondary and sometimes even tertiary communication methods available. What happens if Zoom or WebEx is down and you can’t stand up a virtual war room? Do you have access to a call sheet so you can get a hold of everyone and stand up a conference call? What would you do if your primary chat method such as slack or teams wasn’t available or was compromised? Do you have Signal or Wickr set up so you can still chat with your team?

Additionally, I highly recommend you copy legal on your incident communications so they can advise you accordingly (see my post on Legal Privilege). I also recommend you encrypt your communications, which can help protect your comms if you are operating in a compromised environment.

Access To People

Lastly, but probably the most important, is access to your people. People are essential when responding to an incident and running table tops to prepare them is essential. Cross training should also happen regularly in case an expert in a specific skill isn’t readily available. Ultimately, the more training and cross training your team has, the better prepared you will be for an incident.

Conducting table top drills and running through failure scenarios can be fun and also educational. Next time you run a table top exercise take out your top person and see how the team reacts. Put people on the spot in different roles such as investigation, evidence gathering, recovery, interfacing with engineering, etc. If someone struggles with a role during the table top you can flag that area for additional training.

Leadership is also essential during an incident and I’ve written blog posts about this in the past. Try rotating people in different leadership positions so they can get a feel for what it takes. This can also help them understand and anticipate what is needed from them during an incident.

Wrapping Up

Incidents can be both exciting and stressful at the same time. The thoroughness and frequency of your training will dictate how well your team responds during an incident. Most importantly, planning for various worst case scenarios can ensure your team is able to successfully respond, communicate and lead the rest of the business through an incident.

What happens post incident?

One of the most exciting, stressful and true tests of a CSOs ability to lead during crisis is during a security incident. Unfortunately, it is inevitable that CSOs will experience several security incidents during their tenure. This could be something as small as a configuration error, or as large as a full on public data breach. I also think CSOs should assume they are operating in some state of compromise such as a malware infection or an attacker with complete remote access. Given this volatility of enterprise environments and the inevitability of some sort of security incident, the question becomes what happens afterwards? Just because you are no longer under active attack doesn’t mean your work is done. As a CSO, here are some things you should consider after an incident:

Retro / Post Mortem

First, I highly recommend conducting a retro or post mortem on the incident. This is a blameless session to discuss what happened, why it happened and most importantly what the team learned from the incident and what they are collectively going to do differently. During this exercise the CSO should plan to ask questions and listen a lot. I find being the note taker is exceptionally helpful because it is difficult for me to talk while taking notes. I want to hear opinions and ideas from the team on what happened and how we are going to improve. The result of this retro will likely kick of follow on activities such as increased training, requests for investment, development of new processes, creation of new detections or even additional automation. The point is it is a time to learn and rebuild.

Investment

Never let an incident go to waste. Instead of viewing the incident as a failure, I choose to look it as an opportunity for growth. It is easy to point the finger after the fact, but no environment is 100% perfect and secure. Therefore, after the retro is completed it is important to capitalize on the incident to ask for additional investment to respond to the improvement areas raised in the retro. This could mean asking for additional personnel to respond to events, it could mean additional training to fill knowledge gaps, it could mean a new tool or technology to improve detection and response or new processes to improve responses. As the CSO you need to distill down the retro suggestions into an actionable plan that focuses on the risks along with areas for improvement and investment.

Increased Regulatory Requirements

Unfortunately, there are consequences for having a security incident. For publicly traded companies this can come in the form of increased regulatory requirements such as being required to have an independent 3rd party audit your security practice on a periodic basis. It could also mean fines or reporting the incident as material in your upcoming SEC filings. You may even get inquiries from various regulatory bodies asking what happened and what you are doing to prevent it in the future.

Legal Ramifications

Similar to the regulatory requirements your company may face increased legal pressure as a result of the security incident. This could come from a number of different sources such as: lawsuits from outside entities (such as a consumer group or class action), increased contractual obligations from customers who are now concerned about your security practices, law enforcement investigations and costs for outside counsel who have expertise in your specific business area and are helping your company limit the damage.

Financial Implications

This is probably the biggest area for a company to navigate after an incident. Financial implications can manifest in a number of ways. Here are a few:

Increased Cyber Insurance Costs

As a result of the incident, your company may face an increase in cyber insurance premiums. These premiums may even become so expensive that your company can no longer afford to pay them, or your company could be deemed uninsurable.

Customer Impact

Depending on your business, this could be something simple like contracts not getting renewed and loss of new business or it could be something more material like having to pay to notify all of your consumers, paying for credit monitoring and having to compensate your loyal customers in some way to retain their business.

Market Impact

This is a broad area, but post incident your company could face a decrease in stock price. It may even be more difficult for your company to secure lines of credit because of the business risk. The severity and financial impact of the incident could potentially put your company at risk for M&A takeover or may require the business to declare bankruptcy and look for a buyer who has enough capital to weather the storm.

Budget Impact

Again, this is a broad area, but whenever I have an incident I try to keep track of the number of people involved, number of hours it took to deal with it and the opportunity cost of dealing with this versus doing something else I had planned. As an example, the opportunity cost could be something like “as a result of this incident we had to delay project x for 3 months while we dealt with the incident.” All of this will have an impact to timing, budgets and manpower.

Can You Calculate The True Cost Of An Incident?

Unless your company keeps track of everyone’s time and is able to create a time code entry for each specific incident, I find it unlikely that we will ever know the true cost of an incident. There are many direct costs, but there are so many indirect costs that are hard to estimate. The ramifications of an incident may stretch on for years and may change hands from one CSO to the next. However, I do think a CSO should have enough grounding in business principles to be able to estimate the cost of an incident. This can be useful when gaining support from your other C-Suite peers, when presenting to the board or when making investment requests to the CFO. Having context for all of the ramifications of an incident along with potential areas of growth and improvement can be a valuable story to tell.

Leadership During An Incident

At some point in your CSO career you are going to have to deal with and lead through an incident. Here are some things I have found helpful.

Know Your Role

Unless you work at a very small company, I argue your role is not to be hands on keyboard during an incident. You shouldn’t be looking up hashes, checking logs, etc. Your role is to coordinate resources, focus efforts and cognitively offload your team from key decisions. You need to lead people during this chaotic event.

Declaring An Actual Incident

This may vary depending on company size and type, but in general the CSO should not be the one to declare a security incident. The CSO (and their representatives) can certainly advise and recommend, but declaring an incident carries legal, regulatory and business ramifications that should be made by a combination of the Chief Legal Counsel and some representation of C-Suite members (CEO, CTO, etc.). Once an incident is declared, your company will most likely need to disclose it on SEC forms and customers may need to be notified. All of this could impact your company’s reputation, stock price and customer goodwill.

Use A War Room

A war room is simply a place where everyone can gather for updates, questions, etc. It is a place that is dedicated to this function. If you are physically in the office, it is a dedicated conference room that has privacy from onlookers. If you have a virtual team it is a Zoom, Teams, WebEx, etc. that gets created and shared with people that need to know.

The CSO’s role in the war room is to keep the war room active and focused. Once the war room is created and the right people join, everyone should discuss what happened, what is impacted and what the course of action should be. Document this somewhere and pin it to the appropriate channels. If people join and start asking basic questions, send them away to read the existing documentation first. If people want to have a detailed technical discussion then send them to a breakout room. The point is to keep the main room clear for making decisions and directing resources.

Bridge The Gap

Your role during an incident is two fold – 1) Communicate to other leaders within the company about what happened so you can get the appropriate support to resolve the incident and 2) Direct the appropriate resources to focus on resolving the incident quickly, while following appropriate chain of evidence, legal requirements, customer notifications, etc.

Communicating To Executive Leadership and the Board
 

Keep it short and sweet so they can respond as needed. The purpose of this email is to inform them so they can give you the support you and your team need. Make sure to invoke legal privilege and keep the audience small (I discuss this in my post about Legal Privilege).

I use the following email template when communicating about an incident.

Subject: PRIVILEGED – Security Incident In [Product/Service X]

A security incident was detected at [Date / Time] in [product x] resulting in [data breach/ransomware/etc.] At this time the cause of the incident is suspected to be [x]. Customer impact is [low/medium/high/critical].

The security team and impacted product team are actively working to resolve the incident by [doing x]. This resolution is expected [at date / time x].

For any questions please reach out to me directly or join the war room [here].

Next update to this audience in [x time period].

Communicating To Responders
 

Your job here is to get the team any resources they need, offload them from decisions and then get out of their way. It is also important that you buffer them from any distractions and protect them from burnout by enforcing handoffs and reminding people to take breaks. It is easy for your team to get caught up in the excitement and sacrifice their personal well being. Learn to recognize the signs of fatigue and have resource contingency plans in place so you can shift resources as needed to keep the overall investigation and response on track.

Designate someone to help coordinate logistics like meeting times, capturing notes, etc. Capture action items, who owns the action item and when the next update or expected completion time will be.

Have A Backup Plan An Practice Using Them

Hope for the best and prepare for the worst. Can your incident response team still function if your messaging service is down? What if your paging program doesn’t work or you can’t stand up a virtual war room? Part of your incident response playbooks should include fallback plans for out of band communications in the event of a total disruption of productivity services at your company. Practice using these during table top exercises so everyone knows the protocols for when to fall back on them if needed.

Wrapping Up

Incidents are both exciting and stressful. It is up to the CSO to lead from the front and provide guidance to their team, executive leadership and the rest of the organization. CSO’s need to buffer their teams to allow them to focus on the task at hand, while protecting them from burnout. CSO’s also need to remember the conduct and response of the organization could be recalled in court some day so following appropriate evidence collection, notification guidelines and legal best practices are a must.