The CrowdStrike outage affected millions of people and systems. It taught us some valuable lessons to weather the next storm. To their credit, CrowdStrike found the issue quickly, turned around a fixed version of their update, and assisted customers affected with guidance and information on recovering. You can tell a lot about a business by how they respond to adversity and disaster. Yes, they should have had more testing and controls in place and yes, the effects were felt broadly, but the response was an example to companies out there looking for what "good looks like."
Here are some of the lessons learned that I took away from the outage and what companies were reporting. These are generalized and I will publish some more specifics in the near future.
Preparation and Planning
Comprehensive DR Plans:
Ensure that DR plans cover a wide range of scenarios, including service outages. Regularly update and test these plans.
Redundancy and Failover:
Implement robust redundancy and failover mechanisms to minimize the impact of outages. This includes geographical diversity of data centers and automated failover processes.
Detection and Response
Early Detection Systems:
Employ advanced monitoring tools to detect issues early. This can involve anomaly detection systems that can flag unusual patterns in network traffic or system performance.
Clear Communication Protocols:
Establish clear internal and external communication protocols to keep all stakeholders informed during an incident. This includes having pre-drafted messages for different types of incidents.
Incident Management
Incident Response Team:
Maintain a dedicated and well-trained IR team that can quickly mobilize during an outage. Regular training and simulations are crucial.
Root Cause Analysis:
Perform thorough root cause analysis post-incident to understand what went wrong and how to prevent it in the future.
Customer Impact and Communication
Transparency:
Be transparent with customers about the nature of the outage, its impact, and the steps being taken to resolve it. Transparency builds trust.
Customer Support:
Enhance customer support capabilities during an outage. This includes providing regular updates and having a clear channel for customer inquiries.
Continuous Improvement
Post-Incident Review:
Conduct a detailed post-incident review to identify what went well and what didn’t. Use this review to update IR and DR plans.
Feedback Loop:
Create a feedback loop where lessons learned from incidents are incorporated into ongoing training and system improvements.
Technology and Tools
Modernize Infrastructure:
Use modern, resilient infrastructure and cloud services that offer built-in redundancy and disaster recovery options.
Automation:
Leverage automation in both incident detection and response to reduce the time to mitigate issues.
Security Integration
Integrate Security in DR Plans:
Ensure that security considerations are integrated into DR planning. This includes protecting backup data and ensuring that recovery processes are secure.
Threat Intelligence:
Use threat intelligence to anticipate and prepare for potential threats that could lead to outages.
A follow-up with specific examples will be published in the near future. talking about specific issues and prevention or avoidance methods.
© William Tulaba / All Rights Reserved / Information Security
Natick, MA 01760