Microsoft states it has resolved the global outage caused by Azure AD that left many of its customers of various cloud services without access to their business applications and platforms overnight.
According to Downdetector’s outage tracking data, on March 15, Microsoft users faced technical issues with the company’s Microsoft 365 online productivity suite. Shortly after the incident, Microsoft updated its health status page, confirming that customers of its various cloud-based applications may be facing problems while accessing the business applications.
The company also released a statement that services operating on cloud-based identity and access management service Azure Active Directory (AAD) may be affected. The status report further states that users of its cloud platform Azure, business intelligence software Dynamic 365 and Microsoft Managed Desktop service also experienced access problems.
According to Chris Hickman, Chief Security Officer at Keyfactor, a provider of cloud-first PKI as-a-Service and crypto-agility solutions says, “Service outages due to expired keys and certificates are becoming more common. Unlike recent outages, this one impacted authentication at the Azure AD level, throwing more than a dozen Microsoft services offline. Identifying the root issue didn’t take long, but the process of restoring every individual service took 14 hours.”
During the incident, Microsoft was constantly updating its users on its various social media channels about the progress. The company also admitted that a recent update to its authentication system had been the key factor that affected its users across the globe.
The reason behind Microsoft’s Azure incident was the expiration of keys. Microsoft explained that it had retained a key from expiring, which was needed to carry out Azure AD migration. However, the key’s retained state was ignored by Microsoft’s automated process, which caused tokens signed with the keys to be distrusted, resulting in service disruptions.
Around 11 pm, on March 15, the company rolled out a ‘mitigation worldwide’ to address the issues, with complete ‘remediation’ within an hour of the program’s deployment. Microsoft later rolled back operations to a prior state to address the issues.
“Keys and certificates are important for security and need to be rotated and replaced regularly. But as is the case with many outages, if keys and certificates aren’t managed properly, they create operational risk,” adds Chris Hickman.
On its Twitter account at 11.19 pm, on March 15, Microsoft posted that service health across its various Microsoft 365 services has improved. Furthermore, the company will take steps to resolve isolated residual impacts of those still suffering from the effects. By March 16th, the company published an update on Twitter, “Our monitoring indicates that the majority of the services have been fully recovered.”
However, this wasn’t the first time that Microsoft had an incident with its Azure platform. In September 2020, another incident occurred where users experienced outages tied to its Azure AD service. Another outage occurred a few of years ago which caused 2.5 hours of downtime.
Looking at the pattern, it is clear that Microsoft and other companies need to efficiently manage their keys. If not, such outages can heavily impact an organization in the short term. “Microsoft has its own proprietary and automated system for key rotation,” says Chris Hickman “this outage reiterates that whether your business is large or small, keys and certificates are hard to manage.”
“All keys and certificates expire, and even auto-enrolled certificates sometimes fail to renew, which can be difficult to catch until an end-user is impacted. Most companies don’t have the same level of resourcing that Microsoft does. The takeaway here is that companies need to have automated processes and a key and certificate management solution in place to help monitor and manage expiring certificates before they lead to service interruptions,” concluded Chris Hickman.