Data Center Criticality Levels
September 01, 2006
Appeared in Energy & Power Management
No matter the nature of the business, the threat of possible downtime is a concern across virtually every enterprise. While each unexpected downtime event has, at minimum, some indirect effect on the critical mission of all organizations, the potential costs related to a downtime event can vary significantly from business sector to business sector.
For example, a typical call center can potentially recover quickly from unplanned interruption, often in minutes or hours. The consequent cost of the downtime event may therefore amount to no more than the lost hours of production time, added overtime, and some failed IT components. Conversely, a financial institution that experiences that same unplanned interruption of IT services may take days or even weeks to recover. The subsequent cost of that same brief downtime event can be devastating in terms of direct financial losses, as well as to indirect costs associated with client dissatisfaction, brand name damage, and potential loss of future business. Since the negative effects of downtime can vary so dramatically with each organization, it is increasingly important to account for these wide impact variances when classifying the total availability and reliability of each critical facility involving a particular business entity.
More than 10 years ago, the Uptime Institute developed a tier system to measure data center reliability. The system used four tiers, with tier IV representing the highest level of reliability. It laid out a foundation for discussion terminology relating to critical facility reliability. Despite the widespread acceptance of this system, the arrival of high-density computing has altered many of the requirements for mission critical data centers. As modern data centers come to rely on more complex and sensitive infrastructures, the need for an updated and more comprehensive classification system has arisen.
In response to this growing need, Syska Hennessy Group conceived a Criticality Levels system, which itemizes and evaluates a broad list of components that are crucial to the reliability of the critical facility. This concept embraces a broad array of infrastructure systems and related elements to arrive at a meaningful and relevant assessment of facility reliability levels. The Criticality Level determination process incorporates a wide range of relevant factors including HVAC and electrical systems, facility security, IT infrastructure maintenance and operations, and disaster preparedness among others. All of these elements will ultimately come to impact the true reliability of the facility.
This new classification system represents a natural evolution and growth from how the industry previously addressed critical reliability. It also aims to open up a dialogue that actively discusses what the true requirements of current critical facility reliability are and how to adequately incorporate them into planning assessments. Armed with a more comprehensive classification system, facility managers and corporate facility leaders will be better equipped to articulate to senior management, contractors, and consultants the elements and best practices to implement and follow. This will ensure the optimum amount of redundancy and reliability for the critical facility’s critical mission, while serving the needs of the organizations’ client base.
The Criticality Level concept incorporates expected availability and reliability between sites that are designed, constructed, commissioned, maintained, and operated at different priority levels. A facility’s overall criticality is subjectively based upon several key parameters and evaluates not only how robust and redundant the facility is designed, constructed, and commissioned, but also how well the facility is maintained and operated. To accomplish this, the facility is analyzed by a number of key components, subsystems, and processes. These vital elements are analyzed based upon industry sector experience and then compared with global best practices. As a result, each component, configuration, subsystem, integrated system, or process is evaluated based upon the areas the industry is currently experiencing either potential or actual downtime incidents.
Components Affecting Criticality
Years ago, power failures were the primary contributor to unexpected downtime. However as the critical facility has evolved, the industry has effectively addressed power usage and backup to a point where power failures have been successfully mitigated as a core concern. For the new generation of data centers, a multiplicity of new factors has emerged that now require significant attention.
Overheating is perhaps the most significant among these factors. While slimmer, faster servers facilitate the delivery of higher density data centers, their propensity to consume greater amounts of energy and give off extraordinarily high levels of heat has stoked the demand for more efficient cooling solutions. In order to sufficiently address these new concerns, it is crucial to evaluate the critical facility’s design in terms of its ability to be highly redundant, simple, and flexible. This assessment must be done in order to circumvent potential technological error and deal effectively with maintenance challenges.
Balance is also an important component to achieving maximum reliability. For example, without balance across all key parameters of the critical facility that highly robust, redundant, and expensive MEP support system may suffer the same failure rate as a simple and inexpensive system if the operating staff is not properly trained and equipped with the good procedures, or budgets do not allow for proper maintenance and re-commissioning. In order to best capture this balance, a number of components must be evaluated with regard to their influence on a facility’s Criticality Level. Some of the key components include:
- Capacities including utility, standby and UPS power, and cooling. A facility with excess installed capacity that can readi ly accept an increase in critical load is more reliable than the same facility operated near, at, or in excess of its redundant design load.
- Physical expansion capability. If a facility will eventually need to expand its support equipment capacities, its initial designs should be implemented with future growth and expansion in mind. Depending on the facility’s Criticality Level, its expansion should have little impact on ongoing IT operations and should have minimal effect on current and future availability. While this is a logical strategy, it is also a tremendous challenge because the facility designer must predict the IT support needs for as many as 10 years in the future.
- Load power and heat density. Higher density in watts-per-square-foot, or watts-per-rack increases a facility’s complexity. Raised floor and ceiling heights have become limitations for high-power densities. High-load densities also make component failures, especially those involving cooling and air distribution, which are less forgiving. As another example, chilled water that is employed above, alongside, or inside of server racks for high-density loads introduces a new potential failure. Leak detection that effectively shuts down all localized cooling and/or shuts down power to a cabinet of servers may be implemented, but this creates new challenges for IT system operations and redundancies.
- Size in the critical portion of scalability of facility infrastructure. While it is common knowledge that larger facilities are more complex than smaller ones, the ability to scale equipment accordingly in these facilities is often limited. This frequently leaves very large facilities with numerous small support systems, which become several smaller inefficient and less reliable than a larger one under one roof.
The Criticality Levels
C1 describes a basic critical facility supporting local office processes that are routine and are not backed up regularly. The loss of availability is roughly equivalent to the loss of local productivity, and the facility can achieve simple and rapid recovery from unplanned downtime.
C2 facilities support critical business processes that are both local and remote. The data/telecom support may be more critical than the process it supports, such as a call center, or equal to or less critical than the process it supports, such as a trading floor. The loss of availability for a C2 facility could widely affect productivity, with full recovery after momentary unplanned downtime potentially taking hours. Maintenance downtime can be regularly scheduled.
C3 encompasses back-up corporate facilities supporting and/or including critical business processes. The loss of availability widely affects productivity and directly affects customers, with full recovery after momentarily unplanned downtime taking hours or even days. Maintenance during low-risk windows can be scheduled monthly or quarterly.
C4 facilities are primary corporate facilities that support and/or include critical business processes. Similar to C3 facilities, the loss of availability widely affects productivity and directly affects customers, with full recovery after momentarily unplanned downtime taking hours or even days. Online maintenance with moderate risk windows can be scheduled monthly or quarterly with blackout periods, and maintenance shutdowns are extremely difficult to schedule.
C5 describes a primary corporate facility supporting and/or including core business processes. The loss of availability directly translates to the facility’s bottom line, with full recovery after momentary unplanned downtime taking days to possibly weeks. Online maintenance with low-risk windows can be scheduled quarterly or annually with blackout periods. Maintenance shutdown cannot be scheduled.
C6 facilities are primary, large corporate data centers supporting and/or including core business processes and are, typically, a network of remote data centers that work together. The loss of availability poses widespread circumstances that can affect national security and public safety. Full recovery after momentary downtime can take weeks to months and all maintenance must be performed online and must be extremely low-risk.
Levels C7-C10 currently do not exist, These future levels constitute a key element of the Criticality Levels in that they consciously recognize the continuing growth and evolution of the critical facility industry and our need to effectively communicate about them with each other. The classification system also allows for the evaluation of critical facilities beyond a C10 designation.
Defining the Criticality Level of a specific facility then enables the facility designer to classify the importance and level of quality for an array of external components. This includes components such as standby power, UPS configuration, IT configuration, EPO systems, security, and IT redundancy. For example, when evaluating the security component, a C1 facility would require only a locked room, while a C6 facility would require a highly secure and controlled environment featuring surveillance cameras, 24-hour security personnel and biometric access control. By comprehensively assessing these components in relation to the Criticality Level of the facility, designers can plan the facility with the needed flexibility, redundancy, and security for the organization’s critical mission.
As previously discussed, flexibility is a key element in the delivery of a reliable and expandable critical facility. It is also one of the cornerstones of a professional’s ability to communicate about them using the Criticality Level definitions. They are useful for the design of a broad array of critical facilities, each with their own specific function. They also allow for further levels of criticality beyond what is currently defined in lieu of new innovations and the continued importance of data centers in today’s information-driven business environment.
Jerry Burkhardt and Richard W. Dennis, P.E.
Jerry Burkhardt PE and vice president of commissioning services at Syska Hennessy Group, has over 21 years of specialized experience in critical and live environments. He has dedicated the past 17 years of his career to critical facility consulting and has been actively involved in the planning, design, commissioning, and operations of data centers, trading floors and call centers. His data center experience has focused on both new buildings and existing facility upgrades to increase uptime reliability and integrate new technology. Burkhardt is a vice president at Syska and currently leads a team on the West Coast dedicated to commissioning and operations of critical support systems. Richard Dennis PE and vice president of national commissioning services for Syska Hennessy Group, has over 30 years of professional engineering experience. Dennis is responsible for managing, overseeing, and growing Syska Hennessy’s commissioning activities and capabilities nationwide. He has an extensive engineering background, which includes the management of design, construction, and facility operations. He successfully managed projects including critical system upgrades and commissioning activities for many Fortune 500 companies and Government Agencies.