Most people think of datacenters as large buildings filled with hundreds or thousands of cabinets, servers, and network infrastructure. In hyperscale datacenters, that number increases to multiple large buildings, each with multiple data halls, each with thousands of servers. The focus of this blog is not the servers that deliver compute, storage, and applications to users. I’m focused on the electrical power, air conditioning, fans, and other electro-mechanical infrastructure that ensure those servers have a clean, cool environment, and reliable source of power.
The modern datacenter has several OT systems, separate from their customer-owned or customer-facing servers & network infrastructure. This includes:
All of the various OT systems above support the production servers and network infrastructure providing customer services. As an infrastructure component supporting potentially entire data halls, buildings, or campus, their operability has the potential to cause significant and widespread business interruption, as well as equipment damage.
Here are some example impacts that can or have occurred as a result of datacenter OT systems.
1) Azure Incident VN11-JD8 – after a voltage dip, programming of chillers prevented them from automatically restarting. A thermal runaway event occurred and the facility was forced to shutdown for almost 24 hours.
2) A similar event occurred with Azure Incident VVTQ-J98 where a power sag caused chillers to fail their restart, led to a thermal runaway event, forcing them to shutdown the facility for half a day.
3) Azure Incident 2LZ0-3DG – the programming of transfer switches prevented both the primary and backup UPS from correctly restoring power after a disruption event, leading to total momentary outage and restart.
Thank you Microsoft for disclosing the root causes for others to learn from; they are one of the few that provide such information publicly. In all these cases, they are primarily due to minor programming errors in the original design of these OT control systems that were exposed under the right conditions.
It is important to remind the reader that anything that is programmed to behave in a safe and reliable manner can also be re-programmed maliciously to cause downtime or premature physical equipment failure.
Both the cybersecurity and physical security of the OT infrastructure in a datacenter must be protected. DeNexus has developed the DeRISK platform for Datacenters to help quantify the potential financial impact of a cyber or physical security event causing downtime, equipment damage, and other losses.
If you want to learn more, get in touch with our team, or understand how the above is put to use to quantify and manage cyber risks at 250+ industrial sites monitored by DeNexus, you can contact us at https://www.denexus.io/contact.