More than a year ago, DeNexus was asked to evaluate the cybersecurity risk to a hyperscale datacenter as a result of their industrial control systems (ICS) and operations technology (OT).
Most people think of datacenters as large buildings filled with hundreds or thousands of cabinets, servers, and network infrastructure. In hyperscale datacenters, that number increases to multiple large buildings, each with multiple data halls, each with thousands of servers. The focus of this blog is not the servers that deliver compute, storage, and applications to users. I’m focused on the electrical power, air conditioning, fans, and other electro-mechanical infrastructure that ensure those servers have a clean, cool environment, and reliable source of power.
The modern datacenter has several OT systems, separate from their customer-owned or customer-facing servers & network infrastructure. This includes:
- Air Handling – air flow is used to bring chilled air into the data hall and heated air is removed from the environment. OT equipment includes fresh air intake, air pressure balancing, heat exchangers, filters, variable speed fans, and their associated temperature, flow, humidity, and other sensors.
- Water Handling – water is chilled and used in heat exchangers to cool down the air for use in the data hall. OT equipment includes chilled water, algae control, pH control, pumps, tanks, and their associated piping, chillers, and temperature, pressure, flow sensors, and metering of freshwater and wastewater.
- Building Management Systems (BMS) – provides visibility and of the various systems above. Ensuring an optimal environment, clean air, temperature and humidity for production servers.
- Battery and Generator Backup Power – provides uninterrupted power supply (UPS) through battery for short-term power quality issues, as well as longer sustained power from onsite power diesel generators if electrical utility power is lost.
- Electrical Power Management System (EPMS) – controls the reliable flow of electricity into the facility, its redundancy, and energy saving processes to reduce energy costs. OT equipment includes multiple medium voltage (>1kV) utility sources, transformers, switchgear, breakers, transfer switches, as well as the low voltage (under 1kV) distribution, and their associated protective relays, voltage meters, watt-hour consumption meters, and more.
- Physical Access Control Systems (PACS) – controls authorization and access into the physical rooms and zones throughout the facility. OT systems include motion sensors, electronic door latches, magnetic locks, badge readers, PIN and biometric badge readers, as well as their associated wiring, control units, and user access database server.
- Datacenter Infrastructure Management (DCIM) – provides advanced orchestration of the various systems above including air, water, temperature, pressure, electrical flow, physical access, and more.
All of the various OT systems above support the production servers and network infrastructure providing customer services. As an infrastructure component supporting potentially entire data halls, buildings, or campus, their operability has the potential to cause significant and widespread business interruption, as well as equipment damage.
Here are some example impacts that can or have occurred as a result of datacenter OT systems.
1) Azure Incident VN11-JD8 – after a voltage dip, programming of chillers prevented them from automatically restarting. A thermal runaway event occurred and the facility was forced to shutdown for almost 24 hours.
2) A similar event occurred with Azure Incident VVTQ-J98 where a power sag caused chillers to fail their restart, led to a thermal runaway event, forcing them to shutdown the facility for half a day.
3) Azure Incident 2LZ0-3DG – the programming of transfer switches prevented both the primary and backup UPS from correctly restoring power after a disruption event, leading to total momentary outage and restart.
Thank you Microsoft for disclosing the root causes for others to learn from; they are one of the few that provide such information publicly. In all these cases, they are primarily due to minor programming errors in the original design of these OT control systems that were exposed under the right conditions.
It is important to remind the reader that anything that is programmed to behave in a safe and reliable manner can also be re-programmed maliciously to cause downtime or premature physical equipment failure.
Both the cybersecurity and physical security of the OT infrastructure in a datacenter must be protected. DeNexus has developed the DeRISK platform for Datacenters to help quantify the potential financial impact of a cyber or physical security event causing downtime, equipment damage, and other losses.
If you want to learn more, get in touch with our team, or understand how the above is put to use to quantify and manage cyber risks at 250+ industrial sites monitored by DeNexus, you can contact us at https://www.denexus.io/contact.