To quote the infamous line from Paul Newman’s 1967 classic, Cool Hand Luke, “What we have here, is a failure to communicate.”
On March 15th, Microsoft users globally found themselves unable to use a wide range of services including Teams, Exchange, M365, and even Xbox. It was not an issue with those services directly but the service that allows those applications to authenticate users and devices: Azure Active Directory.
Azure Active Directory (Azure AD) is an authentication service that validates users and objects in order to grant access to other applications and services. For example, a user logs into a user account managed by Azure AD. After a user is successfully validated, the systems will grant access to any application the user account has been given permission.
If Azure AD can not authenticate users or other objects, the communication between them is blocked and access is restricted.
Microsoft summarized the root cause of the outage on March 15th as “an error [that] occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.”[1]
Microsoft was successful in rolling back recent updates to their Azure AD services but not before some customers felt the pain for almost 14 hours.
The impact appears focused on Microsoft applications and services. No reports of outages reported by those using Azure AD for tenant-based domain authentication. This is where customers utilize Azure AD to authenticate access into their internal systems.
Unfortunately, there is not a way a customer can mitigate against internal Microsoft outages, but it does give warning to those who do use Microsoft AD as a primary or secondary method of authentication.
Microsoft Active Directory is based on a distributed model in which multiple Directory Controllers synchronize user accounts to validate users. To mitigate exposure to customer tenant Microsoft AD services becoming inaccessible, it is recommended that companies always deploy a secondary Directory Controller outside Microsoft’s environment. This would include onsite deployment or in another hyper-scale cloud such as AWS.
Microsoft has reported several outages at some level over the last year. If you’re concerned about possible outages affecting your clients, contact your Solution Engineer to review suppliers that can help set up Azure AD disaster recovery options.
[1] “Azure Status History,” Azure.Com, last modified 2021, accessed March 16, 2021, https://status.azure.com/en-us/status/history/. For possible outages affecting your clients, contact your Solution Engineer to review suppliers that can help setup Azure AD disaster recovery options.