3 Engineers Fight Server Crisis: Azure Data Center Outage

Three engineers battle a server crisis during Azure's data center outage. Discover their efforts in restoring services and minimizing disruptions.

Leo Zhi
September 13, 2023
9:00 am

A system crash occurs frequently, but one lasting more than 24 hours is not common.

Recently, a sudden interruption in Microsoft Azure services in Sydney, Australia, resulted in users being unable to access Azure, Microsoft 365, and Power Platform services for over 24 hours. Microsoft subsequently released a preliminary analysis report, sparking public attention.

The report attributed the cause to a power drop that left part of a data center in an availability zone offline.’ Because the cooling system could not operate properly, the temperature increased, forcing the data center to shut down automatically.

At the same time, Microsoft also acknowledged that the cooling system could have been manually started. However, due to a shortage of on-duty engineers in the large data center, manual activation was hindered. According to the IT news outlet news, there were only three on-duty engineers at the time.

Following this incident, not only sounded an alarm for many cloud service companies with data centers but also raised the question of ‘how many IT engineers should be sufficient for a data center.

The failure of the data center’s chiller units, with a team of only 3 people, was insufficient to handle manual restart

To recap the entire event, let’s start with a severe thunderstorm that occurred in Sydney, Australia, on August 30th.

According to the Australian weather forecasting agency WeatherZone, on that day, the city experienced approximately 22,000 lightning strikes in just three hours, leading to power outages affecting around 30,000 people due to the storm.

According to Microsoft’s post-incident report, the trouble began on August 30, 2023, at 8:41 UTC when a power drop in the eastern region of Australia caused some cooling equipment in a data center within an availability zone to go offline.

Microsoft explained that the affected data center had a cooling system consisting of seven chiller units, with five of them in operation and two on standby (N+2) before the voltage drop event.

When the voltage drop occurred, all five operating chiller units failed. Additionally, only one of the standby units was operational.

An hour after the power outage, on-site engineers reached the rooftop cooling unit area and immediately initiated the documented Emergency Operating Procedure (EOP) to attempt a restart of the chiller units but were unsuccessful.

The inability to succeed may have been due to a shortage of on-site personnel. Microsoft stated, “Due to the scale of the data center campus and insufficient staffing of the night team to promptly restart the chiller units, only three personnel were present on-site at that time.”

Shutting down the Servers to achieve Cooling

Of course, besides the factor of insufficient on-site personnel, there were also certain malfunctions in the chiller units themselves.

Preliminary investigations revealed that when the five operating chiller units failed, they did not automatically restart because the respective pumps did not receive signals from the chiller units.

“This is crucial because it’s the key to the successful restart of the chiller units,” Microsoft stated in the report, “We are working with our OEM supplier to investigate why the chiller units did not command their pumps to start.”

As a result, the malfunctioning chiller units couldn’t be manually restarted, causing the temperature in the chilled water loop to exceed the threshold.

This condition persisted until 11:34 on that day when components of the affected data center started issuing infrastructure overheating warnings, prompting the initiation of a shutdown of selected compute, network, and storage infrastructure. This was in line with the original internal design, where shutting down these functions could protect data persistence and infrastructure integrity.

With no other option, the on-duty engineers at Microsoft that night had to shut down two of the affected servers.

The Road to Server Recovery was Fraught with Numerous Challenges

Fortunately, the practice of shutting down servers to reduce the heat load proved to be effective. “This successfully lowered the temperature of the chilled water loop below the required threshold and restored cooling capacity,” Microsoft stated in the report.

However, not everything went smoothly during the recovery process.

According to the preliminary post-incident report, at 15:10 on the same day, all hardware was restored to power, and later, the storage infrastructure began coming back online. Nevertheless, as underlying compute and storage scale units came online, the dependent Azure services also began to recover, but some services encountered issues during the reactivation process.

From a storage perspective, seven tenants were affected, including five standard storage tenants and two premium storage tenants. While all storage data had been backed up across multiple storage servers, in some cases, due to failures and delays in multiple affected storage servers, all backups became unavailable.

Three main factors contributed to the delayed restoration of storage infrastructure functionality:

First, some storage hardware was damaged due to the earlier temperature increase in the data center, requiring extensive troubleshooting. Since the storage nodes themselves were not online, diagnostics couldn’t identify the faults, so the on-site data center team had to manually remove and reinstall components one by one to determine which specific components were preventing the startup of each node.
Second, multiple components needed to be replaced to successfully recover the data and restore the affected nodes. To fully recover the data, some original/faulty components had to be temporarily reinstalled on various servers.
Third, the system’s code automation also failed. It erroneously approved outdated requests and marked some healthy nodes as unhealthy, slowing down the storage recovery process.

Additionally, from a SQL perspective, once the affected data center had fully restored power, the entire service was impacted by the progress of dependent services’ recovery. These dependent services primarily included Azure standard storage products. Many general databases remained unavailable until the advanced Azure storage services recovered.

Meanwhile, Microsoft noted that its tenant environment hosting over 250,000 SQL databases exhibited various failure patterns. Some databases might have been entirely unavailable, while others could have experienced intermittent connectivity issues, and some databases might have been fully operational.

This made it challenging to determine which customers were still affected. Consequently, when on-site engineers attempted to migrate databases out of the affected environment, they encountered SQL tools that had not undergone sufficient testing.

Amidst these various challenges, the downtime continued to extend. According to the preliminary analysis report, the interruption event that occurred on August 30th eventually reached a successful recovery for all standard storage tenants by 06:40 on September 1st.

How to avoid the possibility or impact of such events

This report represents Microsoft’s initial analysis conducted within three days of the incident and anticipates a full-service interruption report to be released within 14 days.

Based on this preliminary analysis, Microsoft has drawn the following insights and lessons learned from the perspective of data center power/cooling:

Insufficient Night Team Staffing: Due to the scale of the data center campus and the inadequacy of staffing during nighttime hours to promptly restart the cooling units, Microsoft temporarily increased the team size before fully understanding the root cause and taking appropriate mitigation measures. (Prior reports suggested an increase from 3 to 7 team members.)
Slow Execution of EOP for Chiller Restart: For events of such significant impact, the execution of the Emergency Operating Procedure (EOP) for restarting the chiller units was slow. In response, Microsoft is exploring ways to enhance existing automation to improve adaptability to various voltage drop event types.
Load Prioritization for Chiller Unit Restart: Microsoft is evaluating how to prioritize the load on different subsets of chiller units to perform restarts on those with the highest load first. Using operational manuals in the sorting of workload failovers and equipment shutdowns can help determine different priority sequences. Microsoft is actively working on improving the reporting of chilled water temperatures to facilitate more timely decisions on failover/shutdown based on thresholds.

These actions reflect Microsoft’s commitment to learning from this incident and implementing measures to prevent or better manage similar events in the future.

How many Operations should be equipped in a data center?

After the above incident, although Microsoft stated that they would increase the team size, many netizens did not see it as a human issue.

I think people have really forgotten how to operate a data center. Many people think that operations are a very difficult thing, requiring thousands of employees. I know AWS/GCP/Azure likes to charge us as if we need to hire a large army of system administrators, but in fact, daily DC operations do not require so many people. Hardware failures are much less than you think, and you can solve them without panic.
@dijit netizen said:

Some users also bluntly said, “Perhaps the management’s idea is just to ‘recruit more people and spend money for peace of mind.’ However, this moment of seeking inner peace will never come because the cloud is becoming more and more complex every year, and it’s hard to keep up.”

What do you think about this?

Recommended Reading:

Disclaimer: YUNZE maintains a neutral stance on all original and reposted articles, and the views expressed therein. The articles shared are solely intended for reader learning and exchange. Copyright for articles, images, and other content remains with the original authors. If there is any infringement, please contact us for removal.

It’s Leo Zhi. He was born on August 1987. Major in Electronic Engineering & Business English, He is an Enthusiastic professional, a responsible person, and computer hardware & software literate. Proficient in NAND flash products for more than 10 years, critical thinking skills, outstanding leadership, excellent Teamwork, and interpersonal skills. Understanding customer technical queries and issues, providing initial analysis and solutions. If you have any queries, Please feel free to let me know, Thanks

3 Engineers Fight Server Crisis: Azure Data Center Outage

Table of Contents

The failure of the data center’s chiller units, with a team of only 3 people, was insufficient to handle manual restart

Shutting down the Servers to achieve Cooling

The Road to Server Recovery was Fraught with Numerous Challenges

How to avoid the possibility or impact of such events

How many Operations should be equipped in a data center?

Latest Posts

2025 Storage Chip Surge: What’s Driving the Boom?

Rare Earth Elements Explained by a Semiconductor Expert

Get Free Consultation

Products

DiskMFR

Others

contact us today

Please let us know what you require, and you will get our reply within 24 hours.

Let's Have A Chat

Learn How We Served 100+ Global Device Brands with our Products & Get Free Sample!!!

3 Engineers Fight Server Crisis: Azure Data Center Outage

Table of Contents

The failure of the data center’s chiller units, with a team of only 3 people, was insufficient to handle manual restart

Shutting down the Servers to achieve Cooling

The Road to Server Recovery was Fraught with Numerous Challenges

How to avoid the possibility or impact of such events

How many Operations should be equipped in a data center?

Latest Posts

2025 Storage Chip Surge: What’s Driving the Boom?

DUV vs EUV Lithography Machines: Key Differences Explained

Rare Earth Elements Explained by a Semiconductor Expert

How Yttrium Oxide (Y₂O₃) Enhances Modern Chip Performance

Get Free Consultation

Please let us know what you require, and you will get our reply within 24 hours.

Let's Have A Chat

Learn How We Served 100+ Global Device Brands with our Products & Get Free Sample!!!