• Delta Air Lines CEO Ed Bastian said the massive IT outage earlier this month that stranded thousands of customers will cost it $500 million.
  • The airline canceled more than 4,000 flights in the wake of the outage, which was caused by a botched CrowdStrike software update and took thousands of Microsoft systems around the world offline.
  • Bastian, speaking from Paris, told CNBC’s “Squawk Box” on Wednesday that the carrier would seek damages from the disruptions, adding, “We have no choice.”
  • Riskable
    link
    English
    231 month ago

    Yeah… Maybe don’t put all your IT eggs in one basket next time.

    Delta is the one that chose to use Crowdstrike on so many critical systems therefore the fault still lies with Delta.

    Every big company thinks that when they outsource a solution or buy software they’re getting out of some responsibility. They’re not. When that 3rd party causes a critical failure the proverbial finger still points at the company that chose to use the 3rd party.

    The shareholders of Delta should hold this guy responsible for this failure. They shouldn’t let him get away with blaming Crowdstrike.

      • Th4tGuyII
        link
        fedilink
        91 month ago

        I think what @[email protected] was saying is you shouldn’t have multiple mission critical systems all using the same 3rd party services. Have a mix of at least two, so if one 3rd party service goes down not everything goes down with it

        • partial_accumen
          link
          fedilink
          111 month ago

          That sounds easy to say, but in execution it would be massively complicated. Modern enterprises are littered with 3rd party services all over the place. The alternative is writing and maintaining your own solution in house, which is an incredibly heavy lift to cover the entirety of all services needed in the enterprise. Most large enterprises are resources starved as is, and this suggestion of having redundancy for any 3rd party service that touches mission critical workloads would probably increase burden and costs by at least 50%. I don’t see that happening in commercial companies.

          • Th4tGuyII
            link
            fedilink
            61 month ago

            As far as the companies go, their lack of resources is an entirely self-inflicted problem, because they’re won’t invest in increasing those resources, like more IT infrastructure and staff. It’s the same as many companies that keep terrible backups of their data (if any) when they’re not bound to by the law, because they simply don’t want to pay for it, even though it could very well save them from ruin.

            The crowdstrike incident was as bad as it was exactly because loads of companies had their eggs in one basket. Those that didn’t recovered much quicker. Redundancy is the lesson to take from this that none of them will learn.

            • partial_accumen
              link
              fedilink
              21 month ago

              As far as the companies go, their lack of resources is an entirely self-inflicted problem, because they’re won’t invest in increasing those resources, like more IT infrastructure and staff.

              Play that out to its logical conclusion.

              • Our example airline suddenly doubles or triples its IT budget.
              • The increased costs don’t actually increase profit it merely increases resiliency
              • Other airlines don’t do this.
              • Our example airline has to increase ticket prices or fees to cover the increased IT spending.
              • Other airlines don’t do this.
              • Customers start predominantly flying the other airlines with their cheaper fares.
              • Our example airline goes out of business, or gets acquired by one of the other airlines

              The end result is all operating airlines are back to the prior stance.

              • @[email protected]
                link
                fedilink
                61 month ago

                Two big assumptions here.

                First, multiple business systems are already being supported, and the OS only incidentally. Assuming double or triple IT costs is very unlikely, but feel free to post evidence to the contrary.

                Second, a tight coupling between costs and prices. Anyone that’s been paying attention to gouging and shrinkflation of the past few years of record profits, or the doomsaying virtually anywhere the minimum wage has increased and businesses haven’t been annihilated, would know this is nonsense.

                • partial_accumen
                  link
                  fedilink
                  11 month ago

                  First, multiple business systems are already being supported, and the OS only incidentally. Assuming double or triple IT costs is very unlikely, but feel free to post evidence to the contrary.

                  The suggestion the poster made was that ALL 3rd party services need to have an additional counterpart for redundancy. So we’re not just talking about a second AV vendor. We have to duplicate ALL 3rd party services running on or supporting critical workloads to meet what that poster is suggesting.

                  • inventory agents
                  • OS patching
                  • security vulnerability scanning
                  • file and DB level backup
                  • monitoring and alerting
                  • remote access management
                  • PAM management
                  • secrets management
                  • config managment

                  …the list goes on.

                  Anyone that’s been paying attention to gouging and shrinkflation of the past few years of record profits, or the doomsaying virtually anywhere the minimum wage has increased and businesses haven’t been annihilated, would know this is nonsense.

                  You’re suggesting the companies simply take less profits? Those company’s board of directors will get annihilated by shareholders. The board would be voted out with their IT improvement plans, and replace with those that would return to profitability.

                  • @[email protected]
                    link
                    fedilink
                    21 month ago

                    Even load-balancing multiple servers in a homogenous network, where patches are only deployed in phases is better (and a best practice) than what, to outside observers, appears to have been everything going down due to a mass update everywhere, all at once.

              • @[email protected]
                link
                fedilink
                11 month ago

                customers start predominantly flying the other airlines with cheaper fares

                I was with you till this part, except with the way flying is set up in this country, there’s very little competition between airlines. They’ve essentially set themselves up with airports/hubs so if an airline is down for a day, that’s kinda it unless you want to switch to a different airport.

                • partial_accumen
                  link
                  fedilink
                  11 month ago

                  In the USA besides very small cities, this isn’t my experience. My flights out of my home airport are spread across 5 or 6 airlines. My city doesn’t even break into the top ten largest in the nation. As far as domestic destinations, There are usually 3 to 5 airlines available as choices.

              • @[email protected]
                link
                fedilink
                11 month ago

                There is an argument to be made that they IT team and infrastructure isn’t supposed to be an ongoing expense or revenue generation. It’s insurance against catastrophe. And if you wanna pivot to something profit generating then you can reassign them to improve UX or other client impacting things that can result in revenue gain. For example notification systems for flight delays are absolute garbage IMO. I land, I check in my flights app and it doesn’t show any changes to when my flight is departing, I load google and those changes are right there. Or they could add maps for every airport they operate a flight from to their apps. They could streamline the process for booking a replacement flight when your incoming flight is delayed or you missed a connecting flight (i had to walk up to a desk, wait in a queue with dozens of other people for half an hour just to be stampped with a new boarding pass and moved along). They could add an actual notification system for when boarding starts (my turkish air flight at one airport didnt have an intercom so i didnt know it was boarding and missed the fligbt). All of these are just examples but my point is theres an inherent shortsightedness in assuming an investment in IT, especially for a company that deals primairly with interconnectivity, is wasted. This is the reason everything is so sh*tty for users. Companies prefer minimising costs to maximising value to the user even if the latter can generate long term revenue and increase user retention.

              • @[email protected]
                link
                fedilink
                01 month ago

                Our example airline has to increase ticket prices or fees to cover the increased IT spending.

                Or they could just cut already excessive executive bonuses…

                • partial_accumen
                  link
                  fedilink
                  11 month ago

                  You know they’re not going to do that, so how useful is it to suggest that? If we just want to talk about pie-in-the-sky fixes then sure, but at the end of that we’ll likely have nationalized airlines, which that isn’t happening either.

                  So are we talking about fantasy or things that can actually happen?

                  • @[email protected]
                    link
                    fedilink
                    21 month ago

                    No, we’re talking about things that should happen and things that should be called out every time.

                    Not just throwing up our hands and going “welp, they won’t willingly do it so there’s nothing we can do” like you seem to be doing.

        • @[email protected]
          link
          fedilink
          61 month ago

          In this case, it’s a local third party tool and they thought they could control to cadence of updates. There was no reason to think there was anything particularly unstable about the situation.

          This is closer to saying that half of your servers should be Linux and half should be windows in case one has a bug.

          Crowdstrike bypassed user controls on updates.
          The normal responsible course of action is to deploy an update to a small test environment, test to make sure it doesn’t break anything, and then slowly deploy it to more places while watching for unexpected errors.
          Crowdstrike shotgunned it to every system at once without monitoring, with grossly inadequate testing, and entirely bypassed any user configurable setting to avoid or opt out of the update.

          I was much more willing to put the blame on the organizers that had the outages for failing to follow best practices before I learned that they way the update was pushed would have entirely bypassed any of those safeguards.

          It’s unreasonable to say that an organization needs to run multiple copies of every service with different fundamental infrastructure choices for each in case one magics itself broken.

          • kbin_space_program
            link
            fedilink
            51 month ago

            Crowdstrike also bypassed Microsoft’s driver signing as part of their update process, just to make the updates release faster.

            That MS is getting any flak for this is just shit journalism.

      • Riskable
        link
        English
        31 month ago

        Adding another reply since I went on a bit of a rant in my other one… You’re actually missing the point I was trying to make: No matter what solution you choose it’s still your fault for choosing it. There are a zillion mitigations and “back up plans” that can be used when you feel like you have no choice but to use a dangerous 3rd party tool (e.g. one that installs kernel modules). Delta obviously didn’t do any of that due diligence.

        • @[email protected]
          link
          fedilink
          31 month ago

          Kernel module is basically the only way to implement this type of security software. That’s the only thing that has system wide access to realtime filesystem and network events.

          Yes, they’re ultimately liable to their customers because that’s how liability works, but it’s really hard to argue that they’re at fault for picking a standard piece of software from a leading vendor that functions roughly the same as every piece of software in this space for every platform functions, which then bypassed all configurations they could make to control updates, grabbed a corrupted update and crashed the computer.
          It’s like saying it’s the drivers fault the brakes on their Toyota failed and they crashed into someone. Yes, they crashed and so their insurance is going to have to cover it, but you don’t get angry at the driver for purchasing a common car in good condition and having it break in a way they can’t control.

          What mitigations should they have had? All computer systems are mostly third party tools. Your OS is a third party tool. Your programming language is a third party tool. Webserver, database, loadbalancer, caching server: all third party tools. Hardware drivers? Usually third party, but USB has made a lot of things more generic.

          If your package manager decides to ignore your configuration and update your kernel to something mangled and reboot, your computer is going to crash and it’ll stay down until you can get in there to tell it to stop booting the mangled kernel.

          • Riskable
            link
            English
            11 month ago

            It is absolutely not the only way to implement EDR. Linux has eBPF which is what Crowdstrike and other tools use on Linux instead of a kernel module. A kernel module is only necessary on Windows because Windows doesn’t provide the necessary functionality.

            Mitigating factors: Use (and take) regular snapshots and test them. My company had all our virtual desktops restored within half an hour on that day. If you don’t think Windows Volume Shadow Copy is capable or actually useful for that in the real world then you’re making my argument for me! LOL

            Another option is to use systems (like Linux) that let you monitor these sorts of EDR things while remaining super locked down. You can run EDR tools on immutable Linux systems! You can’t do that on Windows because (of backwards compatibility!) that OS can’t run properly in an immutable share.

            Windows was not made to be secure like that. It’s security contexts are just hacks upon hacks. Far too many things need admin rights (or more privileges!) just to function on a basic level.

            OSes like Linux were built to deal with these sorts of things. Linux, specifically, has gone though so many stages of evolution it makes Windows look like a dinosaur that barely survived the asteroid impact somehow.

            • @[email protected]
              link
              fedilink
              31 month ago

              eBPF, the kernel level tool? Because you need to be in the kernel to have that level of access, which is what I was saying? The one with a bug that crowd strike hit that caused Linux servers to KP?
              Yes, I said “kernel module” when I should have said “software executing in a kernel context”. That’s on me.

              By the way, eBPF? Third party software by most metrics. Developed and maintained by Facebook, Cisco, Microsoft, Google and friends. Also available on windows, albeit not as deeply integrated due to the layers of cruft you mention.

              I’m glad you were able to recover your VMs quickly. How quickly were you able to recover your non-virtualized devices, like laptops, desktops or that poor AD server that no one likes?
              Airlines need more than just servers to operate. They also need laptops for various ground crew, terminals for the gate crew and ticketing agents, desktops for the people in offices outside the airport who manage “stuff” needed to keep an airline running.

              You seem to be much more interested in talking about Linux being better than windows, which is a statement I agree with, but it’s quite different from your original point that “Delta is at fault because they used third party tools”.

              My point was that it’s unreasonable to say that Delta should have known better than to use a third party tool, while recommending Linux (not written by Delta), whose ecosystem is almost entirely composed of different third parties that you need to trust, either via system software (webserver), holding your critical data (database), kernel code (network card makers usually add support by making a kernel patch), or entire architectural subsystems (eBPF was written by a company that sells services that use it, and a good chunk of the security system was the NSA).

              None of that bothers me. I just don’t get how it doesn’t bother you if you don’t trust well regarded vendors in kernel space to have those same vendors making kernel patches.

      • Riskable
        link
        English
        21 month ago

        If I were in charge I wouldn’t put anything critical on Windows. Not only because it’s total garbage from a security standpoint but it’s also garbage from a stability standpoint. It’s always had these sorts of problems and it always will because Microsoft absolutely refuses to break backwards compatibility and that’s precisely what they’d have to do in order to move forward into the realm of, “modern OS”. Things like NTFS and the way file locking works would need to go. Everything being executable by default would need to end and so, so much more low-level stuff that would break like everything.

        Aside about stability: You just cannot keep Windows up and running for long before you have to reboot due to the way file locking works (nearly all updates can’t apply until the process owning them “lets go”, as it were and that process usually involves kernel stuff… due to security hacks they’ve added on since WinNT 3.5 LOL). You can’t make it immutable. You can’t lock it down in any effective way without disabling your ability to monitor it properly (e.g. with EDR tools). It just wasn’t made for that… It’s a desktop operating system. Meant for ONE user using it at a time (and one main application/service, really). Trying to turn it into a server that runs many processes simultaneously under different security contexts is just not what it was meant to do. The only reason why that kinda sort of works is because of hacks upon hacks upon hacks and very careful engineering around a seemingly endless array of stupid limitations that are a core part of the OS.

        • kbin_space_program
          link
          fedilink
          61 month ago

          Please go read up on how this error happened.

          This is not a backwards compatibility thing, or on Microsoft at all, despite the flaws you accurately point out. For that matter the entire architecture of modern PCs is a weird hodgepodge of new systems tacked onto older ones.

          1. Crowdstrike’s signed driver was set to load at boot, edit: by Crowdstrike.
          2. Crowdstrike’s signed driver was running unsigned code at the kernel level and it crashed. It crashed because the code was trying to read a pointer from the corrupt file data, and it had no protection at all against a bad file.

          Just to reiterate: It loaded up a file and read from it at the kernel level without any checks that the file was valid.

          1. As it should, windows treats any crash at the kernel level as a critical issue. and bluescreens the system to protect it.

          The entire fix is to boot into safe mode and delete the corrupt update file crowdstrike sent.

        • @[email protected]
          link
          fedilink
          English
          3
          edit-2
          1 month ago

          I enjoy hating on Windows as much as the next guy who installed Linux on their laptop once, but the bottom line is 90 percent of businesses use it because it does work.

          Blaming the people who made the decision to purchase arguably the most popular EDR solution on the planet and use it (those bastards!) does nothing but show a lack of understanding how any business related IT decisions work.

      • @[email protected]
        link
        fedilink
        11 month ago

        Alternatively, they could have taken Crowdstrike’s offer of layered rollouts, but Delta declined this and wanted all updates immediately to all devices.