CrowdStrike Incident Shows Importance of Test Strategy, Phased Rollout, and Learning from Failure
Hey, I am Klaus Haeuptle! Welcome to this edition of the Engineering Ecosystem newsletter in which I write about a variety of software engineering and architecture topics like clean code, test automation, decision-making, technical debt, large scale refactoring, culture, sustainability, cost and performance, generative AI and more.
Thanks for reading Software Engineering Ecosystem! Subscribe for free to receive new posts.
A recent CrowdStrike incident caused one of the biggest IT outages in history. This incident had a severe negative impact on organizations across the globe. Delta Air Lines alone claiming to have lost 500 Million - it had to cancel 5000 flights in three days. CrowdStrike has now published its Root Cause Analysis, which investigates the causes of the error that led Windows machines to display the blue screen of death. The report also provides insights into the testing process and the lessons learned on what needs to be improved.
The incident was caused by a software update that was rolled out to all customers at once. The configuration update was not tested properly and caused a critical error in the Windows kernel affecting 8.5 Million Windows machines.
On Friday, July 19, 2024 at 04:09 UTC, as part of regular operations, CrowdStrike released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques. The crashes were due to a defect in the Rapid Response Content, which went undetected during validation checks. When the content was loaded by the Falcon sensor, this caused an out-of-bounds memory read, leading to Windows crashes (BSOD).
(Source: CrowdStrike Root Cause Analysis t
How CrowdStrike prevents this from happening again
CrowdStrike plans to introduce several changes to prevent similar incidents from happening in the future. These changes include:
Local developer testing
Content update and rollback testing
Stress testing, fuzzing and fault injection
Stability testing
Content interface testing
Additional validation checks for configuration changes
Enhance Error Handling
Further changes are around phased rollouts:
Implement a staggered deployment strategy in which updates are gradually deployed to larger portions starting with a canary deployment.
Improve monitoring, collecting feedback during deployment to guide a phased rollout.
Provide customers with greater control over the delivery of updates by allowing granular selection of when and where these updates are deployed.
Provide content update details via release notes, which customers can subscribe to.
And conducting multiple independent 3rd party security code reviews.
How Microsoft plans to prevent this from happening again
Microsoft helped customer to recover from the incident. Another question is how Microsoft plans to prevent such incidents from happening in the future. Why a 3rd party company can run processes at kernel level which can crash an operating system? Very interesting is also the perspective from the Wikipedia article on CrowdStrike:
Microsoft blamed a 2009 antitrust agreement with the European Union that they said forced them to sustain low-level kernel access to third-party developers. The document does not explicitly state that Microsoft has to provide kernel-level access, but says Microsoft must provide access to the same APIs used by its own security products.The EU rejected the allegations. The European Commission spokesperson told Euronews that "Microsoft is free to decide on its business model. It is for Microsoft to adapt its security infrastructure to respond to threats in line with EU competition law. Additionally, consumers are free to benefit from competition and choose between different cybersecurity providers."
The spokesperson also said that "the incident was not limited to the European Union and that Microsoft has never raised any concerns about security with the Commission either before or after the incident."
In Linux, it is possible to use eBPF instead of kernel modules to program this type of software.
Since macOS Catalina (2019), this type of software can use the Endpoint Security Framework instead of a kernel extension, and this approach has been gradually enforced.
This highlights the complexity and underscores the importance of a detailed incident review from Microsoft's perspective. Could similar architecture changes be made to Windows to prevent such incidents in the future? Microsoft has published a blog post on Windows Security Best Practices for Integrating and Managing Security Tools. This blog post is a first step with insights into how Microsoft is working to prevent such incidents in the future. One element is to strengthen the usage of Rust in the Windows Kernel to improve memory safety.
Conclusion
It is surprising that a company like CrowdStrike, which is known for its cybersecurity solutions and is developing software with such criticality, did not have a proper test strategy in place to prevent such an incident. On the other hand, it is great that they publish their incident review and root cause analysis in such detail.
This shows how important it is to invest in a test strategy fitting to the criticality of the software. As well as phased rollouts, production-like test environments, and to learn from failure. And Windows could learn from other operating systems like Linux and macOS to prevent such incidents in the future by changing its architecture to reduce the potential impact of 3rd party software.
Resources
CrowdStrike Update: Latest News, Lessons Learned from a Retired Microsoft Engineer
Windows Security Best Practices for Integrating and Managing Security Tools
Mark as not spam: : When you subscribe to the newsletter please do not forget to check your spam / junk folder. Make sure to "mark as not spam" in your email client and move it to your Inbox. Add the publication's Substack email address to your contact list. All posts will be sent from this address: ecosystem4engineering@substack.com.
❤️ Share it — The engineering ecosystem newsletter lives thanks to word of mouth. Share the article with someone to whom it might be useful! By forwarding the email or sharing it on social media.