Can we avoid a repeat of Friday's CrowdStrike crash chaos?

Mass IT outage hits companies and infrastructure around the world

Jetstar passengers in Melbourne face crashed check-in computers Source: AAP / JAMES ROSS/EPA

Get the SBS Audio app

Other ways to listen

Australians are being warned it could take up to two weeks to fully resolve the disruption to computer systems affected by the global tech outage. Microsoft says up to 8.5 million devices using its operating system were affected by the outage, triggered by a faulty software update from CrowdStrike. But what are the implications for a world that has become so reliant on its computers?


Listen to Australian and world news, and follow trending topics with

TRANSCRIPT

(SBS World News theme)

“Good evening and welcome to SBS World News, I’m Anton Enus. Well, it’s been an extraordinary day for broadcasters here and around the world as well as for institutions across the nation.”

Anton Enus says the SBS World News Bulletin he presented on the evening of the outage ((Friday 19 July)) was unlike anything he'd done in four decades of broadcasting as normally reliable computer systems failed - not just for SBS, but for many broadcasters around the world.

Just what happened on the afternoon of the 19th of July is now widely known.

An automatic update by United States software firm CrowdStrike crashed millions of computers around the world, bringing chaos to transport, retail, media and many other areas.

Microsoft say less than two per cent of computers running Windows software were affected - so why was the trouble so widespread?

Associate Professor Mark Gregory is from the school of engineering at RMIT University:

"Those 1.8 per cent are typically enterprise customer computers, and those enterprise customers are the ones that affect much of our lives. They're the banks, the airlines, the critical services and other organisations. And what we know is that the update went wrong. Those computers all crashed."

The outage was caused by a single update automatically rolled out by CrowdStrike, a security tool used by many large organisations to block malware and cyber attacks.

The fault caused Windows computers to display the so-called 'Blue Screen of Death', trapped in what the experts call a 'recovery boot loop'.

It also hit Microsoft's Azure Cloud, one of the major suppliers of cloud computing, and failures there led to additional breakdowns around the world.

Dr David Glance is the director of the Centre for Software and Security Practice at the University of Western Australia.

He told SBS News a significant part of the problem is that users were happy to let CrowdStrike take care of security without much in the way of checks and balances.

"I think everybody had become complacent in believing that we could trust the companies like CrowdStrike to do the right thing. And clearly this was a massive failing on their part. And this is where I am curious to see what litigation follows because it's just pure negligence on their part to actually release something without the appropriate testing."

Tom Worthington is an honorary lecturer in the School of Computing at the Australian National University.

He also says he's surprised CrowdStrike hadn't picked up that there was an issue in their security update before it was rolled out to millions of computers around the world.

"That's a normal software development practice that we teach to undergraduate students: you don't go from fixing it to releasing it to all your customers. You go through a number of test stages, but there's no way you can eliminate every possible problem with the software. These things will happen from time to time, but you've got to make sure it doesn't take out everything."

Mr Worthington says if someone's entire business is depending on one particular software product working, then they need to make sure they have alternatives.

Associate Professor Gregory says the incident is a consequence of slack security practices.

"It's highlighted the idea of critical single points of failure, but it's also highlighted poor engineering processes and inadequate testing before updates are being pushed out by these major corporations. The idea that this has happened in 2024 is frightening because there's the potential for significant damage to occur, not just to things like airlines and to banks, but also within hospitals and to critical care equipment and systems."

Dr Glance says organisations often rely solely on single providers for software, so if something goes wrong with that software, there's no backup.

He says the more ubiquitous a software product is, it's more likely that any fault will have major consequences.

"One of the considerations when people are looking for software to use, and it doesn't matter what software it is going for, the most popular and the most widely used is not necessarily always going to be a good choice in future, because of this very problem, and to have variety and also as we said earlier, potentially taking precautions against just automated updates from people, and trusting third parties."

Big companies were happy to use CrowdStrike because standardisation brings a number of benefits: systems are compatible across industry; they run efficiently; and large numbers of staff know how to work with the software.

But as Friday the 19th proved, a single problem can cascade around the world in a matter of minutes.

Tom Worthington agrees it's unwise to rely on one sole provider, however well-known they are.

Because in the main, the systems have worked so reliably, we've fallen into a false sense of security. We've said, okay, well if you just buy from the major supplier everybody uses, it'll be fine. But if everybody buys from them and then you have a problem, everybody has a problem."

Tom Worthington says businesses should always make sure they have some kind of alternative system to fall back on.

He says the key is - be prepared.

"What people need to do is see, do they have a backup system using different software that can be used - so that for a small business, it might be as simple as can you run your business on your smartphone if your cash register/desktop computer isn't working? It might be a manual system you have to go through. For a big business, do you have multiple communications networks from different vendors? Do you have backups of the software? So if there's suddenly a problem, can you roll back to a version that you know was working? And things like do you have contact details for your staff on the weekend and have you paid them to be available to quickly come in and fix these problems?"

In 2023, the U-S Federal Trade Commission fielded a call for public comments about business practices involving cloud computing.

Microsoft insisted that competition was 'highly dynamic and competitive'.

But rival company Google insisted that Microsoft's licensing restrictions effectively prevented customers from choosing any other cloud provider than their own.

Mark Gregory says he has been calling for the Australian Government to implement minimum performance standards from the tech corporations.

"Australia is a third-world country when it comes to technology and telecommunications and IT. We need to move ourselves away from this third-world thinking. What we have is legislation and regulation that is fit for the 20th century. It's not fit for the current century. We have thinking within government that corporations will do right by the nation when we know that they don't."

And Dr Glance has a warning for all:

"No, I don't think this is an isolated event and I think that people may get better at protecting themselves against it, but I think they just need to get smarter about how they do their IT and cybersecurity infrastructure”


Share