What Should Apple Users Take Away from the CrowdStrike Debacle?

Priceless.

1 Like

Dave Plummer’s videos came to the same conclusion. The CS kernel driver tried to dereference a bogus memory pointer, which caused the crash. And it seems directly related to the fact that one of CS’s data files was all-zeros - almost certainly a mistake.

So there’s some shared blame to go around here:

  • CS’s kernel driver should be validating data files as a part of loading them, before using their content. Especially because (it seems) that the content of these files affects the behavior of the (audited, approved and signed) kernel driver software.

  • CS should be testing their releases. If they had tested this update on an in-house computer, they would almost certainly have seen the crash and fixed it before it got out to the rest of the world. It takes less than an hour to do a test like this. It is going to take weeks for the rest of the world to clean up their mess.

  • Microsoft should probably update the WHQL program to require similar auditing and signing of data files that affect kernel drivers in this way. Even if doing so will delay the release of a security update. CS has proven that you can’t trust vendors like this.

  • The laws in the EU need to change. The Mac version of CS doesn’t have this problem because it uses Apple’s security framework instead of a kernel driver.

    Microsoft allegedly developed something similar for Windows but European lawmakers decided that they couldn’t use it because it was going to be restricted to only approved security companies, which would be illegal anti-competitive behavior. Maybe this mess will convince them to reconsider their decision? Probably not, but one can always hope.

    Sadly, I think that the EU may actually respond by outlawing Apple’s API, in the name of “fairness”. They seem to enjoy carefully analyzing every problem and mandating the dumbest possible of all “solutions”.

1 Like

I have been calling for liability for bad software and licensing for software engineers for -decades- (in part to establish and in part to limit liability - the same way it works for civil engineers). The computer professional societies have opposed this, even though they would be instrumental in establishing licensing terms and procedures. I remember one exchange with someone who in most other matters I really respected. “You mean licensing, like barbers?” “No. I mean licensing like doctors and civil engineers.”

But CrowdStrike blaming their -test software- for what was clearly failures in design, in coding, and in release management done by developers is truly appalling. Microsoft, though, has trained us all to believe “software failure is inevitable.”

We do know ways to produce better software. Better, less error-prone programming languages is one approach. So is the use of formal methods and assertions, etc. You don’t necessarily have to prove the entire program correct to gain substantial benefits from adopting these techniques.

3 Likes

ALL of the suggested improvements CrowdStrike is promising to make should have been in place long ago. This is not rocket science, these best practices were not rigidly followed and that’s why the outage occurred.

In June, CrowdStrike sent a different faulty channel update to Enterprise Linux RHEL, Debian, Ubuntu all had many kernel panics. That was almost 30 days later when they did it to Windows! It’s not supposed to be possible to crash the Linux kernel when writing a kernel module. So I am not sure how or why it did manage to cause Linux kernel panics.

One could make a legal argument based on negligence and that would supersede any EULA legal junk. If the courts accept such a lawsuit, CrowdStrike is going to be in serious financial trouble. The CEO likely will be replaced.

The Falcon software receives rapid response updates that are like virus definitions on steroids. This faulty update was one of that style of update. It’s a major selling point feature of Falcon. The majority of instructions to the Falcon sensor are included with software upgrades of the Falcon sensor. The rapid response updates are intended to provide protection against an in-the-wild and actively being exploited attack. This particular update was to protect against bad actors abusing a Named Pipes developer feature in Windows. The CrowdStrike issue was not a software update but it does impact the software. When Falcon reads this faulty content at boot, it crashes the Windows kernel causing the BSOD and an automatic reboot. Removing the faulty content fixed the problem but in order to do so on an enterprise laptop means you need the BitLocker Recovery key and possibly the local administrator password if you have to use Safe Mode. If you can get into WinRE Recovery Environment then you can get to Troubleshooting → Advanced Options → Command Prompt which runs as the SYSTEM god-mode user account. That meant manual intervention by IT staff who had to either walk each user through the fix remotely on the phone. Or hands-on and possibly using a bootable flash drive. In our case, half our PC’s were installing Windows Updates when CrowdStrike crashed Windows half-way through. Most were recovered but we had around a half-dozen laptops get bricked to the point of needing to be swapped to get the user back online and working. Servers were easier to fix due to no BitLocker. VM’s had to have their virtual disk disconnected and mounted on a different VM then you deleted the offending CrowdStrike Falcon C-00000291*.sys and dismount and remount to the original VM. For hard servers we could access via BMC (lightsout / ipmi / etc) then we could get it to boot into WinRE and we deleted the file and rebooted. We were able to automate both server fixes and all the servers were working early Friday morning. But yeah… We conscripted any IT staff with a pulse that could follow directions and help someone else do the same over the phone. We didn’t even log tickets for the end user fixes. We just shared a spreadsheet and went as fast as possible. Had our offshore IT workers covering after hours. Prioritized critical staff and fixed them first.

Having spent late Thursday last week through Monday fixing tens of thousands of servers and laptops; I can say that we are not pleased with CrowdStrike. Despite the mess, CrowdStrike is normally really good software and it does a far better job than any other EDR - security tool. I would expect a license discount of significant proportions for the next year or two. We were fortunate we were able to get out of this mess with only 2 business days of downtime and the weekend. We brought up customer facing solutions first then the VDI (virtual servers / workstations). Only then did we tackle the end user workstations, mostly laptops. I was certainly glad to have a Mac for work as my own PC started experiencing the BSOD. It enabled me to fix it myself as I have access to the BitLocker Recovery Keys and Microsoft LAPS to look up the randomized local administrator password.

The EU mandated Microsoft not block access to the kernel because it means Microsoft has access to private API’s and the low level kernel which gives them an unfair advantage competing with 3rd party developers. However, Apple did revoke kernel access while adding alternative API’s and system extensions. Microsoft needs to do the same thing. They also need to explain that fairness shouldn’t apply to kernel ring level 0 and that only Microsoft should have kernel level access the same way Apple only has kernel level access. There’s a big difference between providing developers full unrestricted kernel access vs granting them only the access they need and to sandbox the 3rd party code from being capable of causing a kernel panic.

1 Like

The July Microsoft Windows Update may result in an unexpected BitLocker Recovery Key prompt and this has nothing to do with CrowdStrike causing Windows kernel panics (BSOD).

Not all but some people had CrowdStrike fixed only to reboot and get another BitLocker Recovery Key prompt. We rotate the recovery keys as they are single-use. Once you use a recovery key it boots up and an MBAM client phones home and then rotates the key to a new one and escrows it back into Azure AD / MBAM.

I saw this more than a few times after fixing CrowdStrike, but thankfully not everyone. Some PC’s were half-way through the Win Updates when Falcon started the BSOD boot loop. Once we got them into Safe Mode, the updates will uninstall, reboot and uninstall some more. This might mean entering the BitLocker Recovery Key 2-3 times. Once you delete the C-00000291*.sys Falcon content update and reboot it would be fine. But the next time the user reboots, those Win Updates are re-installed and it may still come back with a BitLocker Recovery Key prompt. Calling the Help Desk and putting in the newest recovery code will fix as it won’t do it again.

Microsoft isn’t free and clear: 1) they should block 3rd party developer access to the operating system kernel. 2) They should provide a sandboxed safe API for security related apps to utilize. Unsafe actions such as writing to a protected memory space should not be allowed. There should be enough warnings about the revocation of kernel access and hopefully by Win 12 it will be all worked out.

Ever since Apple revoked access to kernel extensions, I’ve not witnessed a single kernel panic in macOS across many thousands of Macs.

1 Like

Only CrowdStrike customers were impacted. That includes many of the top corporations in the world. The Falcon enterprise license is $184.99 per computer and another tier that is negotiable depending on how many computers you have.

CrowdStrike is a security research think-tank and they consult with entities who have been compromised by malware and hackers. They investigated major breaches like Columbia Sony Pictures. If a new novel attack is discovered, they can create a rapid response channel update and push it out to all their customers in a very short period of time. Thus providing proactive protection against an in-the-wild threat that is actively being exploited.

The 291 channel update was intended to defend against hackers abusing a developer interface known as Named Pipes. Somehow that configuration was malformed and it wasn’t caught by their normal testing methodologies. They are adding several checks to their pipeline before it is published to Falcon sensor clients. They are also going to update the client to do it’s own sanity checks when reading a bad channel update. The hex zeros written to the file appear to be due to Windows crashing and the file being reset to all zeros. However, there was originally different content for the first crash. Subsequent crashes came about because Falcon read that faulty file with zeros that based on the amount of data in the file, Falcon jumped to a portion of memory it wasn’t supposed to and attempted to alter what was in that memory location. Doing so, caused the operating system kernel to panic triggering the BSOD at every boot. Windows then reboots after writing the BSOD details to the logs and it may keep some DMP files for debugging. It immediately crashes on boot every time the Falcon sensor loads that bad content update.

Microsoft has blamed the EU but it’s not clear this is the case. Doesn’t seem hard to believe that Microsoft is blaming ‘regulation’ when it’s actually Microsoft’s fault that Windows has a poor security architecture.

1 Like

More confirmation that the rumor about Southwest still using Windows 3.1 is a hoax:

UPDATE July 31, 2024 12:14 PM

Additional sources confirm that stories about Southwest using Windows 3.1 are misinformation.

An OS News article links to the tweet by Artem Russakovskii that was the hoax’s origin. The article also links to an The Dallas Morning News article, titled “What’s the problem with Southwest Airlines scheduling system?”, that claims that Southwest used obsolete versions of Windows.

https://www.osnews.com/story/140301/no-southwest-airlines-is-not-still-using-windows-3-1/

ABC reported that neither Southwest nor Alaska use CrowdStrike

2 Likes

Joking aside, if you ever glance behind the counter at all kinds of businesses (including airlines and auto repair), you’ll find that quite a lot are still running an IBM 3270 terminal emulator that is (presumably) connecting to an app running on a mainframe somewhere.

For apps like this, there is no need whatsoever for a modern computer. An 8088 PC running MS-DOS can run a 3270 terminal emulator just as well as a brand new PC or Mac running Windows or macOS.

Which brings to mind an interesting thought. For all those businesses that are still using 3270 emulation, why are they bothering to run full-featured operating systems like Windows (or macOS)? Why not instead get the simplest hardware that your IT department can maintain (which can be the cheapest PC made or a Raspberry Pi or other similarly small systems) running an operating system stripped-down to the minimum necessary features for running the 3270 emulator.

Linux can be stripped-down that far, to the point where you can boot it from a small read-only microSD card, which would be very secure, simply because there isn’t very much software to attack and a reboot would reset everything.

2 Likes

It’s a great question – my suspicion is that a lot of businesses don’t want to go through the annoyance of trying to run something even vaguely custom. They’d much rather buy something “off the shelf” (so to speak) with lots of promises from salespeople.

The allied question is why upgrade when you have a solution that works? We hear stories about organizations still running off floppy disks etc, but if it works why take the risk and go through the effort of upgrading it? If it becomes unrepairable or the supplies aren’t available, then sure. But otherwise? Especially with something that’s not connected to the Internet and thus without serious security risks…

But really, I’m just posting to get to the point where I can add one of my favorite news stories of the last two decades – the business in Texas still doing (as of 2010!) their payroll on an IBM 402, a machine released in 1948!

1 Like

I can only imagine what it’s like to be a young person getting a job in an IT department for a business that still relies on a mainframe app. “Kid, pull up your VT100 and let me show you some COBOL.” :slight_smile:

1 Like

A fun trick to play on the newbies is to switch their TN3270 client into APL mode when they’re away from their desk.

4 Likes

That was me when I first started out at a large corporation circa 1997. Think Model 90 IBM PC with tiny hard disk, Win3.11 / Netware and Rumba 3270 empulator. They didn’t even have Microsoft Mail. Most people used email on the mainframe where you had to uuencode attachments. Lots of people still had IBM 3270 dumb terminals. It was all token ring networking.

I was hired to cover the field office support while the entire team flew out to each office, upgraded their servers and replaced all the computers with Gateway computers running NT 3.51 with the early version of Microsoft Mail that became Outlook a bit later on.

There are thin-clients with a computer like a RaspberryPi attached on the back of the monitor. Pretty much just runs VMware client. All our VDI’s were impacted, we had to patch them as they were all running CrowdStrike Falcon Sensor. Because the VDI’s were not encrypted with BitLocker it was a bit easier to fix them. One of our VDI engineers managed to find a way to automate the fix. The VDI’s are stored in Azure but managed with VMware vSphere infrastructure. Pretty sure we were seriously thinking about ditching VMware when Broadcom bought them. But someone at Broadcom saw the writing on the wall so they spun off the VDI portion of the business to a new entity. Since Broadcom jacked up all their license fees we are removing all their products from our infrastructure.

I bet there are still AS/400 systems chugging along even today…

The lawyers are circulating Crowdstrike.

https://www.reuters.com/legal/delta-sues-crowdstrike-over-software-update-that-prompted-mass-flight-2024-10-25/