What Should Apple Users Take Away from the CrowdStrike Debacle?

Interesting watching fake news get created in real time. The source of the “Southwest uses Win 3.1” was a random guy on twitter making a joke, which Yahoo then picked up, and then the GovTech site got it from Yahoo. If you follow the GovTech links to the Yahoo page, you find this one:

About which he later notes: “To be clear, I was trolling last night…Yahoo News is quoting me as a source. This is getting out of control.”

3 Likes

A major factor is that many government agencies and service providers do not run systems modern enough to use Crowdstrike software or OS’s that would be affected by the Crowdstrike update crisis.

One recent example, among many:

and, also in Northern California:

1 Like

Exactly. Crowdstrike was able to do this because Microsoft permits developers to modify the Windows kernel; Apple no longer permits access to the kernel.

1 Like

Thanks for that information and for the links.

To summarize, Microsoft is explaining that the European Commission is culpable for the CrowdStrike outage.

2 Likes

In February 2024, Crowdstrike announced layoffs in the USA and moved most tech jobs to low-cost India. They proudly announced this via a press release. In a stark reminder that history doesn’t repeat, but it does echo, George Kurtz who is CEO of Crowdstrike was the CTO of McAfee when it famously launched The Great McAfee XP Bricking Fiasco of 2010. Coincidence or simply confirmation that dross floats to the top?

2 Likes

Apparently prior CrowdStrike updates brought Linux systems down.

3 Likes

Bugs and mistakes have no connection to nationality or countries. Anytime humans are involved with something, no matter the geographic location, errors will occur.

1 Like

What I saw mentioned somewhere but doesn’t seem to be mentioned much is that Crowdstrike have also in the past done good for not only Windows users but also Mac users in (as I understand it) discovering and reporting security vulnerabilities in macOS etc. For example:

" AppleVA

Available for: macOS Sonoma

Impact: Processing a file may lead to unexpected app termination or arbitrary code execution

Description: The issue was addressed with improved memory handling.

CVE-2024-27829: Amir Bazine and Karsten König of CrowdStrike Counter Adversary Operations, and Pwn2car working with Trend Micro’s Zero Day Initiative"

etc

1 Like

Here’s CrowdStrike’s public explanation of what happened and how they intend to prevent it in the future with more testing (duh!) and staged rollouts (double-duh!) with customer control and release notes.

2 Likes

My summary of the long posting…

CrowdStrike posted a Preliminary Post Incident Review.

It basically says that they don’t test Rapid Response Content (the channel file update that was pushed). What’s supposed to happen is:

  1. The Content Validator (in the cloud) is supposed to perform validation checks on the file before it is published
  2. The Content Interpreter (on the machine) is supposed to “gracefully handle exceptions from potentially problematic content”

What actually happened was:

  1. Due to a bug in the Content Validator, the bad content data passed validation
  2. The bad content data caused an out-of-bounds memory read in the Content Interpreter
  3. The out-of-bounds memory triggered an exception
  4. The “unexpected exception could not be gracefully handled”…
  5. …resulting in a BSOD

The way they plan to prevent this from happening again is to do the things that they should have been doing all along, such as:

  • Test Rapid Response Content before deployment
  • Validate harder
  • Improve error handling
  • Stagger deployment instead of everywhere all at once
  • Start with a “canary deployment” (to a machine that’s sole purpose is to see if it goes wrong)
  • Monitor whether the deployment causes problems
  • Allow customers to control when the Rapid Response Content is deployed
  • Document what they’re releasing

A full Root Cause Analysis is forthcoming.

1 Like

So the usual “who/what is to blame” in big events, followed by analyses and promises to do better… then relaxing over time due to the need for reducing costs and increasing speed/output, or perhaps the simple human response to periods of non-crisis.

This was a big public-facing issue that got a lot of press… and it didn’t surprise me in the slightest. However, blaming the EU or thinking macOS is immune to this or any other angle is missing the real point: Single-point-of-failure design requires greater testing and controls to avoid critical / widespread crises.

Beyond that, regulatory systems often (if at all) levy meager fines at any corporation that commits a grievous error which inflicts damages to other parties in the form of lost time, money and resources. Pay and move on. Nothing learned. Penalty is the cost of doing business.

Anyone remember the February AT&T wireless failure from a botched update?

FCC Public Safety and Homeland Security Bureau analyzed network outage reports and written responses submitted by AT&T and interviewed AT&T employees. The bureau’s report said:

The Bureau finds that the extensive scope and duration of this outage was the result of several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility's internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.

This was more than just a bad patch. It was a systematic failure. The Ars Technica article also mentions a similar Verizon outage from December that only lasted a couple hours in certain states due to a similar lack of process compliance.

I have no love for AT&T, having witnessed firsthand their lumbering (dis)organization made up of too many parts that often do not communicate or work with each other on important things like updates. Two examples:

A large regional AT&T team that manages fiber backbone & Enterprise connections has described to me how they periodically have a day where hundreds of new tickets (work orders) appear and waste hours of their time when the tickets turn out to be re-opened issues from the past that had been closed. This sometimes occurs after updates pushed from another AT&T division. Their manager has explained this to the higher ups and the source departments, even begging them to at least inform the team when an update is pushed so they can more quickly determine if it is going to be one of those days… And for years AT&T has changed almost nothing about the process, and never giving them notice of an update.

I could relate more of these stories, (like spending 2 days on the phone talking to 12-14 departments/divisions to get a client’s yahoo email saved when they cancelled an old “Business” DSL/phone account… all it took was someone un-checking a box on their screen), but I think it really comes down to:

Does the threat of penalty dissuade a business from doing or NOT doing something?

Has anyone followed the recent Boeing saga?

Dave Plummer (veteran Microsoft system software developer) shared videos explaining what happened from a technical standpoint. For the benefit of those who may find this of interest:

3 Likes

I think this is a pretty exhaustive list of who’s/what’s been blamed and why the blame is misplaced:

Children are taught from an early age that it’s risky to put all your eggs in the same basket. Yet corporate IT around the globe has done that for years, if not decades. They all love Windows and MS Office and they always end up flocking to the same corporate security solutions. Then when that same basket they’re all holding develops a crack and all the eggs fall out the stunned gasping and pearl clutching starts. But why? It’s the entirely expected outcome from making the same simple mistake over and over again. The same mistake every child is taught not to make. You’d think these billion $ departments full of supposedly smart people wouldn’t trip over such a mundane obstacle. Yet here we are. The same crowd that otherwise misses no opportunity to display virtue by calling for diversity, is stunned when the complete lack of diversity leads to meltdown. Shocker.

1 Like

IMHO the question isn’t what should Apple users take away from the CrowdStrike debacle, but rather, what should every user and corporate IT/Security manager take away from it?
Yeah, The CrowdStrike bug did not affect Mac/Linux users. But it also didn’t affect Windows users who did not install CrowndStrike. It affected primarily corporate users whose corporate IT/Security officers selected the CrowdStrike solution and enabled automatic updates. So another list of users not affected by the debacle are corporate users who had CrowdStrike installed, but whose IT/Security officers have disabled auto-update.
The problem lies with over-centralization of the corporate security. Corporate Mac users were saved this time but if a Mac user hands over control of their Mac to corporate IT - sooner or later disaster will strike. A disaster that affects Macs too. It’s just a matter of time. Be that CrowdStrike, bitlocker, NetSkope, or another centrally managed security solution. IT/Security officers love the feeling of being able to control everything and fight off naughty users who dare plug a USB stick into their laptops. Obviously, they all have good intentions in mind. They mean good. They want to protect corporate assets. Even if that leads to getting all Windows machines off the grid for hours. A friend of mine shared with me, in real-time, how the CrowdStrike share price dropped during the debacle and I replied: “Forget about THEIR share price. Look at OUR sales. Look at the airlines that could not get a plane off the ground. Look at the banks that could not transact. We’re talking billions. Not just one company!”.

I use a Mac for my corporate work, but it’s a BYOD so I did not install CrownStrike or bitlocker or whatever our IT wanted me to install. I was exempt. I am participating in our NetSkope SASE pilot and hate this product, not because it does a bad job, but because it blocks too many sites, the AI engine isn’t that intelligent, and I have to manually ask our IT to manually configure the NetSkope gateways to allow those sites through. The good thing is that I can disable it when I want, and revert to our good-old VPN when I need to connect from away into our corporate intranet. Am I really happy with it? Not at all! NetSkope is invasive. It installed itself on my Mac in such a manner that it affects all users. I do keep a personal user on my Mac for non-corporate stuff and I find that NetSkope item at the top of my screen even if I boot into that non-corporate user. I can turn it off, but it’s on by default. Annoying. Invasive. Unwanted.

Where am I taking that? I leave it up to IT and Security officers to do their homework and rethink their security strategies. But for us, mortals, the simple users - my advice is - stay away from those corporate-controlled security solutions and keep your devices secure using user-end products.

I told that friend of mine from the CrowdStrike share story (who also offered quite a few useful tips that helped a lot of my colleagues restore their machines, but is also a die-hard fan of tight centralized IT control of our lives) that the day they force me to surrender admin of my Mac to corporate IT will be the day I announce my early retirement.

1 Like

Apparently CrowdStrike is offering its partners a $10 Uber Eats gift card as an apology.

4 Likes

Interesting. In the UK goods sold must be ‘fit for purpose’, regardless of what the manufacturer’s lawyers might say. This is one of the things I’ve always thought explains why products are more expensive in Europe than the US.

1 Like

My takeaway is that the concept of auto-updates is fundamentally flawed. The idea that someone is able to change my machine at their convenience, without me checking what they are going to do or finding the right time, is nuts, IMHO. I hope this incident means I don’t have to explain why, and this wasn’t even a supply chain malicious attack.

I and a number of others have had an ongoing discussion on github with Brave about that, which is grand fun. Blah blah blah ‘but security and urgency’ is their reasoning for jumping with both feet and eyes closed.

1 Like

Not only does Windows have many, many millions more users than Apple are, most of Windows’ hardware and software products usually cost a lot less than Apple’s.

This CrowdStrike disaster is one example that cheaper is not necessarily better. Their services have not had any problems with their Apple users.

Microsoft’s Incidence Response is a very technical explanation of the CrowdStrike failure.