So the usual “who/what is to blame” in big events, followed by analyses and promises to do better… then relaxing over time due to the need for reducing costs and increasing speed/output, or perhaps the simple human response to periods of non-crisis.
This was a big public-facing issue that got a lot of press… and it didn’t surprise me in the slightest. However, blaming the EU or thinking macOS is immune to this or any other angle is missing the real point: Single-point-of-failure design requires greater testing and controls to avoid critical / widespread crises.
Beyond that, regulatory systems often (if at all) levy meager fines at any corporation that commits a grievous error which inflicts damages to other parties in the form of lost time, money and resources. Pay and move on. Nothing learned. Penalty is the cost of doing business.
Anyone remember the February AT&T wireless failure from a botched update?
FCC Public Safety and Homeland Security Bureau analyzed network outage reports and written responses submitted by AT&T and interviewed AT&T employees. The bureau’s report said:
The Bureau finds that the extensive scope and duration of this outage was the result of several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility's internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.
This was more than just a bad patch. It was a systematic failure. The Ars Technica article also mentions a similar Verizon outage from December that only lasted a couple hours in certain states due to a similar lack of process compliance.
I have no love for AT&T, having witnessed firsthand their lumbering (dis)organization made up of too many parts that often do not communicate or work with each other on important things like updates. Two examples:
A large regional AT&T team that manages fiber backbone & Enterprise connections has described to me how they periodically have a day where hundreds of new tickets (work orders) appear and waste hours of their time when the tickets turn out to be re-opened issues from the past that had been closed. This sometimes occurs after updates pushed from another AT&T division. Their manager has explained this to the higher ups and the source departments, even begging them to at least inform the team when an update is pushed so they can more quickly determine if it is going to be one of those days… And for years AT&T has changed almost nothing about the process, and never giving them notice of an update.
I could relate more of these stories, (like spending 2 days on the phone talking to 12-14 departments/divisions to get a client’s yahoo email saved when they cancelled an old “Business” DSL/phone account… all it took was someone un-checking a box on their screen), but I think it really comes down to:
Does the threat of penalty dissuade a business from doing or NOT doing something?
Has anyone followed the recent Boeing saga?