Peter M. Curtis

Maintaining Mission Critical Systems in a 24/7 Environment


Скачать книгу

that are motivated to plug into the Information Age require reliability and flexibility regardless of whether the companies are large Fortune 500 corporations or small companies serving global customers. This is the reality of conducting business today. Whatever type of business you are in, many organizations have realized that a 24/7 operation is imperative. An hour of downtime can wreak havoc on project schedules or loss of critical information, resulting in lost hours re‐keying electronic data, not to mention the potential for losing millions of dollars.

      Twenty‐five years ago, the facilities manager (FM) was responsible for the integrity of the building. As long as the electrical equipment worked 95% of the time, the FM was doing a good job. When there was a problem with downtime, it was usually a computer fault. As technology improved on both the hardware and software fronts, information technology began to design their hardware and software systems with redundancy, including dual corded equipment (either an A or a B power source can fully carry the IR equipment load). As a result of IT’s efforts, computer systems have become so reliable that they’re only down during scheduled upgrades.

      Minimizing unplanned downtime reduces risk, but unfortunately, the most common approach is reactive. That is, spending time and resources to repair a faulty piece of equipment after it has failed. Strategic planning can identify internal risks and provide a prioritized plan for reliability improvements. Also, only when both ends fully understand the potential risk of outages, including recovery time, can they fund and implement an effective plan. Because the costs associated with reliability enhancement are significant, sound decisions can only be made by quantifying the performance benefits and weighing the options against their respective risks.

      Planning and careful implementation will minimize disruptions while making the business case to fund capital improvements and maintenance strategies. When the business case for additional redundancies, consultants, and ongoing training reaches the boardroom, the entire organization can be galvanized to prevent catastrophic data losses, damage to capital equipment, and even danger to life safety.

      Figure 3.1 “Seven steps” is a continuous cycle of evaluation, implementation, preparation, and maintenance

      (Source: Courtesy of PMC Group One, LLC)

% Uptime/Reliability Level Downtime Per Year
99% 87.6 hours
99.9% 8.76 hours
99.99% 52 minutes
99.999% 5.25 minutes
99.9999% 32 seconds

      In order to design a building with the appropriate level of reliability, a company must first assess the cost of downtime and determine its associated risk tolerance. Because recovery time is now a significant component of downtime, downtime can no longer be equated to simple power availability, measured in terms of one nine (90%) or six nines (99.9999%). Today, recovery time is typically many times longer than outages, since operations have become much more complex. Restoration of a shutdown IT infrastructure backbone must be carried out in a specific sequence so that IT equipment can be restored with limited communication conflicts and be brought back online speedily. Just turning IT equipment on again does not work with our complex IT systems. Is a 32‐second outage really only 32 seconds? Is it perhaps 2 hours or 2 days? The real question is: How long does it take to fully recover from the 32‐second outage and return to normal operational status? Although measuring in terms of nines has its limitations, it remains a useful measurement we need to identify. For a 24/7 facility:

      In new 24/7 facilities, it is imperative to not only design and integrate the most reliable systems, but also to keep them simple. When there is a problem, the facilities manager is under enormous pressure to isolate the faulty system without disrupting any critical electrical loads and does not have the luxury of time for complex switching procedures during a critical event. An overly complex system can be a quick recipe for failure via human error if key personnel who understand the system functionality are unavailable. When designing a critical facility, it is important that the building design does not outsmart the facilities manager. Companies can also maximize profits and minimize cost by using the simplest design approach possible or integrate automatic recovery or “self‐healing” automatic controls to recover from a failure. One prevalent example is the current use of Static Transfer Switches (STS’s) discussed in a later chapter. The STS will automatically and within milliseconds switch power sources to critical equipment.

      (Source: Data from Information Technology Intelligence Consulting).

Industry Average Cost per Hour in 2017
Energy $22,321,000
Brokerage $9,300,000
Media $9,000,000
Manufacturing $8,500,000
Health Care $6,900,000
Retail $6,600,000
Telecommunications $4,800,000
Credit Card Operations $3,100,000
Human Life “Priceless”

      * Prepared by a disaster‐planning consultant of Contingency Planning Research

      Imagine that you are the manager