Peter M. Curtis

Maintaining Mission Critical Systems in a 24/7 Environment


Скачать книгу

de‐certification.

      Technology is driving itself faster than ever. Large investments are made in new technologies to keep up to date with advancements, yet industries are still faced with operational challenges. One possible reason is the limited training provided to employees operating the mission critical equipment. Employee certification is crucial not only to keep up with advanced technology but also to promote quick emergency response and situational awareness. In the last few years, technologies have been developed to solve the technical problem of linkage and interaction of equipment but without well‐trained personnel. How can we confirm that the employee meets the complex requirements of the facility to ensure high levels of reliability?

      The key is to benchmark the facility on a routine basis with the goal of identifying performance deviations from the original design specifications. Done properly, this will provide an early warning mechanism to allow potential failure to be addressed and corrected before it occurs. Once deficiencies are identified, and before any corrective action can be taken, a Method of Operation (MOP) must be written. The MOP will clearly stipulate step‐by‐step procedures and conditions, including who is to be present, the documentation required, phasing of work, and the state in which the system is to be placed after the work is completed. The MOP will greatly minimize errors and potential system downtime by identifying the responsibility of vendors, contractors, the owner, the testing entity, and anyone else involved. In addition, a program of ongoing operational staff training, and procedures is important to deal with emergencies outside of the regular maintenance program.

      The most important aspect of benchmarking is that it is a process driven by the participants whose goal is to improve their organization. It is a process through which participants learn about successful practices in other organizations and then draw on those cases to develop solutions most suitable for their own organizations. True process benchmarking identifies the “how’s” and “whys” for performance gaps and helps organizations learn and understand how to perform with higher standards of practice. Keep in mind that you can’t improve if you don’t measure and benchmark.

      What are some attributes of mission‐critical engineers? Well, mission‐critical engineers are never complacent; they are always organized and prepared, are always creative, and are always looking to improve. They are always observing their surroundings with all their senses, always looking for deficiencies and always ready to take action. A mission‐critical engineer doesn't stop after the first try. Mission critical engineers understand the importance of their positions and how their employers impact the public. They entered this industry to contribute to society. They are ethical, share their knowledge, and strive to motivate others.

      I've been a mission‐critical engineer for close to 30 years and am still puzzled by some things. We all know what an investment of $500 million dollars buys. We invest this money because we think we are buying reliability and business resiliency. After this kind of investment, we are enamored with the infrastructure, and we feel confident that it will pe1fonn as designed when called upon.

      Are we falling short in fields that require this type of intolerance for error? As we are already aware, human error causes approximately 60 percent of all downtime experienced by mission‐critical facilities. This number is far too high. Today there are a growing number of DCIM tools that can help reduce downtime, but we are just beginning to scratch the surface in moving toward a significant reduction in downtime. We are still many years away from that goal of 'zero downtime.' There have been many recent examples of human error that have caused fatalities:

       The crash of Air France Flight 447 that killed 228 people due to a lack of pilot training in surprise situations.

       The head‐on collision of a Metrolink train near Chatsworth, CA, which was probably caused by an engineer who was texting, 25 people were killed and 135 injured.

       The actions of the Costa Concordia captain before and after the collision that led to the death of 32 passengers.

       Colgan Flight 3407 operated under Continental Airlines, which crashed, killing 49 people in the suburbs of Buffalo.

      Either character flaws or a lack of training played a role in each of these disasters. All could have been avoided if the right people had been in these positions.

      Beyond these man‐made disasters, we have natural disasters that are even more difficult to cope with. In the wake of Superstorm Sandy, we are once again reminded of how vulnerable our country's infrastructure is and how large‐scale disasters and catastrophes can produce extended downtime.

      Sandy left millions without power in the tri‐state area, causing untold chaos and the worst gasoline shortages since the 1970s. There are so many ways to defend against these disruptions, from ensuring that the refineries have the appropriate standby or microgrids that are designed to support the critical infrastructure vital to the sustainability of how we live digitally today. How can we expeditiously improve? The critical infrastructure of our country is not something to be left so unprotected. It deserves to be as robust as any missioncritical industry in this country given its importance to health and safety as well as our financial system.

      We are the new mission control, and we need to take a page out of the nuclear, aviation, and first‐responder industries to bridge the gap from a 60 percent human error to a statistic that approaches zero. There is a lot of collaboration and work to do. How do we make this industry a profession? How do we develop the 1ight character? How do we ensure continuous improvement? Having a college degree or mastering a trade is only part of the equation. What programs do we need to develop in our industry?

      Everyday industries are becoming increasingly dependent on continuous business operations. As a result, companies need to understand the level of reliability that they can supply to their customers and evaluate how this can either be improved or maintained. The following chapters will reinforce the concept that reliability and resiliency are dependent on