What happens when the IT infrastructure is too big to fail?

Today’s Microsoft outages, linked to a Crowdstrike update, shows the immense risk we face if we put all our eggs into one huge world-spanning basket.

Some colleagues initially suggested this was part of a co-ordinated attack on Microsoft’s infrastructure which, though it turns out is most likely not the case, was a reasonable first guess given the ongoing issues it experiences with persistent state sponsored hackers.

However, rather than a massive hacking event, the reason for the outage is merely IT administrative and patching problems – which, in fact, account for the majority of Microsoft Azure and M365 outages, though rarely with such a widespread effect. The irony of this particular incident is that this time, the issues are not down to Microsoft activity but related to a Crowdstrike Falcon security update found on a high proportion of Windows desktops and servers.

It is not so much a case therefore, that Microsoft has shot itself in the foot (again), but that this time, a close and trusted friend has done so. I doubt that distinction will give Redmond much comfort.

That a ‘protective measure gone wrong’ has brought such instant chaos to so many countries and industry sectors might surprise many people, but the reality is that public cloud infrastructure is both highly complex and surprisingly fragile.

Issues first appeared on Azure’s updates page yesterday evening with an outage in the Azure US Central region around 10pm UTC, though it is not 100% clear this is the same problem as we are currently seeing since it was later reported that those Azure problems were fixed by 6:30am UK time today. This is not long before the UK started to wake up, get online and find that overnight we appear to have dodged a rather large bullet. The Far East and even parts of Europe, which operate in time zones ahead of us have not fared quite so well, and multiple airlines, airports, transport services, banks and financial processing services have been affected.

Even in the UK, impacts have been reported to trains, NHS, financial and a range of commercial services, as well as a puzzling and very public interruption to Sky News broadcasts for some hours. At the time of writing however, the Azure update page has begun to report that the issues principally lie with the virtual machines themselves. Microsoft recommends that businesses should restore back to versions backed-up prior to 7pm UTC on 18th July.

This tends to confirm that the issue is due to an automated patch or deployment made after that time, but which has been able to cascade globally out to virtually every Microsoft global region – with only Mexico, Central Spain, and China not showing disruption.

In addition, the US government appears to have been spared this time round, potentially because it uses different IT infrastructure – while it may use the word ‘Azure’ in its cloud, it’s not the one the rest of the world (and UK Government) uses.

Risk to UK public services

Computer Weekly recently reported the Microsoft disclosure that despite assurances it made over many years that its services are 100% hosted, operated and supported from within the UK, they are, in fact, not.

The concern for UK citizens should really be that over the past 10 years the UK government has moved core services directly onto Microsoft cloud platforms, which are not dedicated to Government use, or even located 100% in the UK – it is the same service available to literally any Microsoft customer residing anywhere in the world.

This means that the UK public sector has no special terms, no specific security protections and more importantly no prioritisation for service over the corner shop who has an annual M365 subscription.

Police, 999 services, health, and indeed the very fabric of our public society all sit on the Microsoft Cloud or have degrees of dependency upon it. After all, cloud services share some connections, which explains the limited reports of AWS and Google Cloud issues today as well – those are almost certainly associated with their Microsoft connected feeds, or Windows devices.

It’s important that we recognise that the Azure and M365 platforms were never designed for the type of services the previous government has used the Microsoft cloud for. In fact, its terms of service warn against relying on availability of the Microsoft platform, and strictly prohibit its use for high value processing, where disruption could result in harm to individuals or significant financial loss.

Despite this, using the cloud first agenda of the last administration, IT leaders across His Majesty’s Government have run headlong to the Microsoft Cloud regardless and have done little to no diligence to confirm its actually suitable for their needs.

This clear disconnection point might be used to excuse Microsoft of responsibility if it had not been all too happy to allow critical national infrastructure (CNI) services to be onboarded. Whether the new government continues that practice remains to be seen, but at least one of its newly announced measures from the Kings Speech would give benefit in our current position, with an obligation to notify of cyber issues.

It’s very unlikely that we will ever properly understand the nature, scale and impact of this incident because there is little incentive and no imperative to report that information. That’s increasingly a problem since our national liabilities and risk exposure are impossible to determine without that information. Right now we really don’t know what information we hold in the cloud, or what cloud it’s in.

Whilst the last government might have liked to think “aggregation’ could be ignored, we’ve just found out today that having all your eggs in one basket might be a bad idea.

As a country we are exposed like we’ve never been before, and this is a heads up we’d be wise to pay attention to. Whilst I hate to be the bearer of bad news, this is another possible area of crisis the new government needs to prioritise.

Source