One Human Error from Business Disruption at a National Scale?

Those listening to the podcast episodes month over month may notice a theme emerging, identifying and working toward protecting a path to operational resilience is typically what matters most to an organization. For the second year in a row the Caffeinated Risk Summer show coincided with a widespread outage of a major Canadian business. On July 8th 2022 Rogers Communications reported a national network outage that saw millions without cell or internet service and thousands of retailers without the ability to accept Interac payments. June 25th, 2023 Suncor Energy Inc issued a press release confirming a cyber security incident that was obviously light on details beyond customer record safety but ensuing speculation pegged the impact at millions.

While many in the Calgary I.T. community know each other, details on the exact cause of the Suncor incident remain, as they should, tightly held so this post is focused on the publically observable outcomes. The Rogers and Suncor incidents are similar in timing and impact, early summer and payment card system availability, and potentially initial cause human error. While Rogers admitted the network outage was due to a mistake in the planned upgrade procedures, we have no insight into the actual cause of the Suncor incident nor shall we speculate. Instead, we can look at published data trends and government intelligence to complete the threat model, as Jack Jones and Jack Freund maintained in their seminal risk management text, “we often have more data than we think“.

Cyberthreat Defense Report
Infographic – cyberthreat report highlights

The 2023 Cyberthreat Defense report was the basis for the Summer Show podcast and it is worth noting that the top two obstacles to cyberthreats were human factors. The Canadian Centre for Cyber Security lists numerous attack surface areas vulnerable to cyber threats including cyber crime. Cyber crime goes by various names such as phishing, ransomware, social engineering, business email compromise and so forth but the common element is a human inside the organization using the organization’s technology to engage with an adversarial force.

While some organizational leaders had been quick to assess human error as a staffing or skills issue, opting for ever more training and in some cases even threats of dismissal hopefully we are turning a corner on this legacy and rethinking our approach. ESRM takes a mission first focus on security prioritization focusing on business engagement and the Idaho National Labs CCE model has challenges us to look at each of those mission impacting scenarios, identify how cyber elements could play a part in disruption and reengineer around them. I am clearly a CCE fan, mentioning it on multiple episodes, buying copies of the book for my detection engineering teammates and sharing the program link with all unsuspecting folks who ask me about organizational resilience or operational technology security, but never mistake enthusiasm for the truth without testing. Whether that a software design flaw, process design flaw, or simply a stress induced cognitive error I believe we need to accept human error at some point in the system and design systems accordingly. The challenge of course is we cannot predict exactly where or how such errors will appear, therefore we need a different approach that “prevent everything” and “don’t screw up or your fired”.

The CCE book uses the term “hope and hygiene” as a failed security model often played out as compliance exercises, vulnerability scanning and simplistic user awareness training. Paraphrasing here, while such actions are important they neglect the time-tested reality that at some point in the future, a cyber related failure will happen, and the organization should be able to recover. The “all roads lead to Rome” idiom applied to resilience also shows up in the devops camps, very well summarized by luminary Mark Russinovich in a 2020 Microsoft blogpost and an off hand quote I overheard in an industry security summit this past winter who’s source shall remain anonymous due to subject sensitivity and my memory.

“Take a look at your network diagrams and all your maps of stuff. Close your eyes, put your thumb on something and say ‘XXXX now owns that’, and think through how you are going to get operations restored”

The digitalization genie is out of the bottle and we are increasingly dependent on interconnected supply chains, automation, cyber physical and virtualization systems for almost every aspect of our daily lives. This interconnectedness creates a list vulnerabilities that is approaching exponential, most of which will never come to pass, therefore identifying those key intersections of cyber element failure and cascading impacts become the brave new world security professionals must lead our organizations into. I am offering some awkward conversation starters, not as an affront to past leadership decisions but a chance to improve each of our security programs in meaningful ways going forward before we too fall victim.

  1. Much of our defense posture relies on Active Directory controls and privileged account protection measures, how would we rebuild if we lost control of the corporate domain?
  2. What would we do if an adversary re-encrypted all our backup systems and destroyed our active accounts databases?
  3. We have ensured more than 95% of our workstations and servers are running a top-tier endpoint detection and response product, what would we do if an adversary were able to unhook that process?
  4. What if there is a mistake in the next release of our custom system that we don’t pick up in UAT, how much could we stand to lose?
  5. How long can we operate if our main WAN provider is unavailable for more than 8 hours?
  6. How can we respond if an adversary takes control of our automated software installation platform to distribute their malware?

Admittedly these will not be easy conversations and every organization will need to do their own analysis. That said, let’s end this post on an optimistic note nothing is impossible once we are committed — even if it acceptance of loss. Consider the following:

  • There are many skilled and capable people working our ICT departments,
  • Cyber education is now mainstream, not a dark art,
  • Hardware and software quality is higher than it’s ever been while cost is going the other direction,
  • Organizations are investing in cyber security,
  • Rodgers did repair their nation wide outage in a couple days,
  • Interac did invest in network resilience,
  • Petro Canada point of sale services were restored in less than six days

Leave a Reply

Your email address will not be published. Required fields are marked *