Risk! Engineers Talk Governance

Major IT Outage - Criticality & Resilience

Richard Robinson & Gaye Francis Episode 31

In this special Risk! Engineers Talk Governance episode, due diligence engineers Richard Robinson and Gaye Francis discuss the recent IT outage that affected systems worldwide and the importance of criticality and resilience. 

Main take-aways include:

  • The outage was predictable and foreseeable. It was a critical incident.
  • You can't gold plate a single system so that it won't fail. 
  • The importance of redundancy – ensuring you've got another independent system that doesn't rely on the primary system.
  • Engineering organisations are required to look at the credible worst case scenarios that can happen and what you can do and put in place to make sure that they're managed
  • You need to positively demonstrate due diligence about what you're doing. And due diligence means that you take into account the credible critical issues. 
  • When doing these reviews, you have to talk with the senior decision makers, who understand the business’ critical requirements.

 

If you’d like to find out more about Richard and Gaye’s work, head to https://www.r2a.com.au.

The book they refer to is Due Diligence Engineering and can be purchased online at https://www.r2a.com.au/store/p/r2a-engineering-due-diligence-textbook.

 

Megan (Producer) (00:00):

Welcome to Risk! Engineers Talk Governance. Following Friday's, major IT outage that affected systems around the world, we thought we'd bring you this special episode with due diligence engineers, Richard Robinson and Gaye Francis, discussing the importance of criticality and resilience.

(00:20):

If you enjoy the episode and our podcast help spread our work by sharing, rating and subscribing. And watch out for season four, which will be coming soon. Enjoy the episode.

Gaye Francis (00:32):

Hi Richard. Welcome to a special podcast session today.

Richard Robinson (00:36):

Yeah, it's quite remarkable we're actually here (in Melbourne) because we were in the air when that recent IT debacle occurred.

Gaye Francis (00:41):

We were. So we thought we'd do a podcast on the recent IT failure that affected so many systems in Australia and around the world and just the way that it relates to criticality and resilience.

Richard Robinson (00:58):

Yeah, and basically this is something we talk about all the time and our book (Due Diligence Engineers, R2A text) actually speaks about it at some length and we understand the professor of engineering at RMIT was basically going through it again, because people were asking if it was just an accident and we were sort of going; not really, it was predictable and should have been managed and they should have and could have done more than they did.

Gaye Francis (01:17):

So even though it was a low likelihood event, the consequences associated with it were huge and that can be seen around the world with all the implications that it had. So again, it's that criticality argument rather than risk argument. So yes, it may not have been likely, but if it did happen, the consequences were huge.

Richard Robinson (01:39):

And an awful lot of this from our point of view comes down to the finance side of things because organisations are always trying to optimise return. And so what they try to do is push up the greatest availability at least cost. Now the problem with that, of course, is that the world, if you're relying on a single system, you can't gold plate a single system so that it won't fail. So if you're down to one system and it fails, then you must have a backup.

(02:03):

Now, we see this all the time in the electrical business because the economic regulators for the energy business sort of say, look, we want you to have cheap power. And so what they do is they drive, they get rid of the headroom basically in the network so that yes, you are using your assets at an optimum utility. So you're getting sort of the maximum value out of what you've spent on the resources. But there's no resilience in the sense that in the electrical business they talk about N minus one failures. And the intention is that if a single failure happens, like one transmission fails, the network shouldn't care. It should just take over. But that means you've got to have a surplus capacity to achieve that outcome. And obviously that's not an economic thing to do.

Gaye Francis (02:43):

So we're talking about not only resilience and where we sort of talk about the hardening of the network and some of that gold plating stuff, but it's also the redundancy, isn't it? Making sure that you've got another independent system that doesn't rely on the primary system, that in the event that does fail, that you've got some sort of backup.

Richard Robinson (03:03):

Correct. And there's lots of different ways to do it. And as we hammer in the book, you can't gold plate a single system. You just can't stop all single points of failure. And the only way to get high availability at low cost is have two redundant systems in parallel. Now they don't have to be fully redundant and all the other things that you do, but if you don't actually analyse the system like that and that's what an engineering process will do for you, then you really don't know what's going to happen when it all goes wrong.

Gaye Francis (03:31):

So I think there's a couple of lessons learned out of this. IT debacle from last week is that it was absolutely foreseeable, it wasn't an accident. And engineering organisations are required to look at those credible worst case scenarios that can happen and what you can do and put in place to make sure that they're managed at that time. So I think it was a bit of a cop out if they're saying in the news that it was an accident and it couldn't have been seen because systems fail all the time.

(04:04):

I think one of the other interesting things that we've talked around the office this morning was how as a society we've come to rely on technology and electronic everything and that people didn't know what to do. I think that threw it into more chaos. We were actually just talking, Richard and I were in New Zealand last week and all the things that you had to have and you actually had to have an electronic ticket to get on the plane. So if you didn't have an iPhone or a Samsung or something, paper tickets are becoming rarer and rarer. And to juggle your phone and your passport and your baggage and all that sort of stuff.

Richard Robinson (04:42):

I watched you do that. It wasn't a great success.

Gaye Francis (04:45):

I felt like the little old lady, but that's okay. What I was saying is technology's supposed to make things easier, but sometimes it doesn't when all of those things you have to have the different access to different things on your phone at the same time.

Richard Robinson (05:01):

Well, I mean the way it's gone is, and it's like the IT giant stores, there's about five of them out there these days, and they've just dominated everything. But as a small organisation, we've actually gone to a lot of trouble to deconnect ourselves to those things. So we are not online all the time and it's the only way we can guarantee security because basically if you want to be secured, don't be online. I mean it's like you try to stop your kids getting online. I think you might not be successful in the long run, but basically if you don't want your information out there, don't be online.

(05:31):

It's the whole point of the WHS legislature for that is as you eliminate, if you can, you can't eliminate, then you reduce. People aren't testing the elimination option like they ought to be. And I don't fully understand quite why that's the case.

Gaye Francis (05:49):

I've had an absolute brain blank.

Richard Robinson (05:52):

Well that's not constructive, Gaye! So where would you like to land this thing? I think our point about this is that you need to positively demonstrate due diligence about what you're doing. And due diligence means that you take into account the credible critical issues. I mean, it might be reasonable to say a meteorite strike can turn up, but we're not going to worry about it because there's not an awful lot we can do about if it does happen.

Gaye Francis (06:14):

So what you're saying is you look at the credible worst case scenarios, but then also what the controls that you can put in place are. And I think that's where people are missing is they're not actually looking for those additional things that can be done.

Richard Robinson (06:27):

Well, you might remember those reviews we did for the bank a while back and we came across, there was another big consulting firm doing a bottom up availability assessment of what the bank backup systems were. And we came in top down. And I remember talking to the senior guy at the backup center and I was asking questions, is there any particular time when it's more important for the bank to be online than other times during the day? He said, oh yes, there's two times. There's one about 12 at midnight where we balance up between all the other banks. He said if we're not online at that particular half hour or so, and it's all done very quickly, but if we're not online then our books are out of balance for the whole of the next day. He said that causes an awful lot of trouble. So for that window, we want to definitely be online.

(07:08):

He said, the rest of the time we write all the data to two different sites, we can rerun the whole thing. Maybe people get bumped off for half an hour or an hour or something like that, but the system will recover because we've got two genuinely independent sites, two genuinely independent processes, and we write the data to both sites. It's not a problem. I mean there will be some inconvenience, but we have redundancy, he said. But at that particular window for that half hour in the day, we don't have redundancy for that one if we either hit it or we miss it

Gaye Francis (07:38):

And we must be online for it. I think that's another interesting point, isn't it? When you're doing these sort of reviews, you really do have to appeal to the people that live and die by it and they have to understand their business and those critical requirements of their business to be able to say what's credible and critical, worst case scenario sort of stuff.

Richard Robinson (07:59):

And that's the senior decision makers. Now sometimes they don't fully understand just what the criticalities are and perhaps that sometimes are a little bit upset to find people like us asking these sorts of questions. But if you don't ask them, you really don't find out. And what's funny about that was I said, this other firm, which once upon a time I used to work for doing this bottom up review when we were sort of comparing notes between what we were doing, it wasn't that either of us were wrong, both provide insight, but which was the more important one for that particular task. It was the criticality question, not the average availability.

Gaye Francis (08:36):

I think even some of the work that we've just done recently and that context is all important, isn't it? We were doing a review last week and to actually set the context of where they wanted to go was the most important part because that then gives you insight into the criticality and often you've got to expand the context to a higher level to get that understanding.

Richard Robinson (09:00):

Correct.

Gaye Francis (09:01):

So I guess as a wrap up from our viewpoint, the IT incident last week was definitely not an accident. It was predictable and foreseeable, in our terms. It was a critical incident. Yes, it may not have been likely, but the consequences were huge that you would've expect a large organisation to have backup processes. And even if it hadn't have been the source to have the backup processes, the people that relied on the system should have understood that criticality as well and had systems in place to deal with it.

Richard Robinson (09:45):

Indeed.

Gaye Francis (09:46):

So thanks for joining us, we'll see you next time.

 

People on this episode