• @0x0
    link
    English
    43
    edit-2
    1 month ago

    On Wednesday, CrowdStrike released a report outlining the initial results of its investigation into the incident, which involved a file that helps CrowdStrike’s security platform look for signs of malicious hacking on customer devices.

    The company routinely tests its software updates before pushing them out to customers, CrowdStrike said in the report. But on July 19, a bug in CrowdStrike’s cloud-based testing system — specifically, the part that runs validation checks on new updates prior to release — ended up allowing the software to be pushed out “despite containing problematic content data.”

    When Windows devices using CrowdStrike’s cybersecurity tools tried to access the flawed file, it caused an “out-of-bounds memory read” that “could not be gracefully handled, resulting in a Windows operating system crash,” CrowdStrike said.

    Couldn’t it, though? 🤔

    And CrowdStrike said it also plans to move to a staggered approach to releasing content updates so that not everyone receives the same update at once, and to give customers more fine-grained control over when the updates are installed.

    I thought they were already supposed to be doing this?

    • @[email protected]
      link
      fedilink
      English
      91 month ago

      The fact that they weren’t already doing staggered releases is mind-boggling. I work for a company with a minuscule fraction of CrowdStrike’s user base / value, and even we do staggered releases.

      • @[email protected]
        link
        fedilink
        English
        31 month ago

        They do have staggered releases, but it’s a bit more complicated. The client that you run does have versioning and you can choose to lag behind the current build, but this was a bad definition update. Most people want the latest definition to protect themselves from zero days. The whole thing is complicated and a but wonky, but the real issue here is cloudflare’s kernel driver not validating the content of the definition before loading it.

        • @[email protected]
          link
          fedilink
          English
          21 month ago

          Makes sense that it was a definitions update that caused this, and I get why that’s not something you’d want to lag behind on like you could with the agent. (Putting aside that one of the selling points of next-gen AV/EDR tools is that they’re less reliant on definitions updates compared to traditional AV.) It’s just a bit wild that there isn’t more testing in place.

          It’s like we’re always walking this fine line between “security at all costs” vs “stability, convenience, etc”. By pushing definitions as quickly as possible, you improve security, but you’re taking some level of risk too. In some alternate universe, CS didn’t push definitions quickly enough, and a bunch of companies got hit with a zero-day. I’d say it’s an impossible situation sometimes, but if I had to choose between outage or data breach, I’m choosing outage every time.

    • @[email protected]
      link
      fedilink
      English
      31 month ago

      Couldn’t it, though? 🤔

      IANAD and AFAIU, not in kernel mode. Things like trying to read non existing memory in kernel mode are supposed to crash the system because continuing could be worse.

      • @0x0
        link
        English
        21 month ago

        I.meant couldn’t they test for a NULL pointer.

        • @[email protected]
          link
          fedilink
          English
          1
          edit-2
          1 month ago

          They could and clearly they should have done that but hindsight is 20/20. Software is complex and there’s a lot of places that invalid data could come in.

    • @cheddar
      link
      English
      21 month ago

      The company routinely tests its software updates before pushing them out to customers, CrowdStrike said in the report. But on July 19, a bug in CrowdStrike’s cloud-based testing system — specifically, the part that runs validation checks on new updates prior to release — ended up allowing the software to be pushed out “despite containing problematic content data.”

      It is time to write tests for tests!

      • @[email protected]
        link
        fedilink
        English
        11 month ago

        My thoughts are to have a set of machines that have to run the update for a while, and if any single machine doesn’t pass and all allow it to move forward, it halts any further rollout.

    • @[email protected]
      link
      fedilink
      English
      1
      edit-2
      1 month ago

      a bug in CrowdStrike’s cloud-based testing system

      Always blame the tests. There are so many dark patterns in this industry including blaming qa for being the last group to touch a release, that I never believe “it’s the tests”.

      There’s usually something more systemic going on where something like this is missed by project management and developers, or maybe they have a blind spot that it will never happen, or maybe there’s a lack of communication or planning, or maybe they outsourced testing to the cheapest offshore providers, or maybe everyone has huge time pressure, but “it’s the tests”

      Ok, maybe I’m not impartial, but when I’m doing a root cause on how something like this got out, my employer expects a better answer than “it’s the tests”