Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis. Читать онлайн. Hotlib. HOTLIB.NET

Maintaining Mission Critical Systems in a 24/7 Environment

de‐certification.

Technology is driving itself faster than ever. Large investments are made in new technologies to keep up to date with advancements, yet industries are still faced with operational challenges. One possible reason is the limited training provided to employees operating the mission critical equipment. Employee certification is crucial not only to keep up with advanced technology but also to promote quick emergency response and situational awareness. In the last few years, technologies have been developed to solve the technical problem of linkage and interaction of equipment but without well‐trained personnel. How can we confirm that the employee meets the complex requirements of the facility to ensure high levels of reliability?

1.11 Standards and Benchmarking

The past decade has seen wrenching change for many organizations. As firms and institutions have looked for ways to survive and remain profitable, a simple but powerful change strategy called “benchmarking” has become popular. The underlying rationale for the benchmarking process is that learning by example and from best‐practice cases is the most effective means of understanding the principles and the specifics of effective practices. Recovery and redundancy together cannot provide sufficient resiliency if they can be disrupted by a single unpredictable event. A mission critical data center must be able to endure hazards of nature, such as earthquakes, tornados, floods, and other natural disasters, as well as human‐made events. Great care should be taken to ensure critical functions that will minimize downtime. Standards should be established with guidelines and mandatory requirements for continuity of business applications. Procedures should be developed for the systematic sharing of safety ‐ and performance‐related material, best practices, and standards.

The key is to benchmark the facility on a routine basis with the goal of identifying performance deviations from the original design specifications. Done properly, this will provide an early warning mechanism to allow potential failure to be addressed and corrected before it occurs. Once deficiencies are identified, and before any corrective action can be taken, a Method of Operation (MOP) must be written. The MOP will clearly stipulate step‐by‐step procedures and conditions, including who is to be present, the documentation required, phasing of work, and the state in which the system is to be placed after the work is completed. The MOP will greatly minimize errors and potential system downtime by identifying the responsibility of vendors, contractors, the owner, the testing entity, and anyone else involved. In addition, a program of ongoing operational staff training, and procedures is important to deal with emergencies outside of the regular maintenance program.

The most important aspect of benchmarking is that it is a process driven by the participants whose goal is to improve their organization. It is a process through which participants learn about successful practices in other organizations and then draw on those cases to develop solutions most suitable for their own organizations. True process benchmarking identifies the “how’s” and “whys” for performance gaps and helps organizations learn and understand how to perform with higher standards of practice. Keep in mind that you can’t improve if you don’t measure and benchmark.

1.12 What is a Mission Critical Engineer

What are some attributes of mission‐critical engineers? Well, mission‐critical engineers are never complacent; they are always organized and prepared, are always creative, and are always looking to improve. They are always observing their surroundings with all their senses, always looking for deficiencies and always ready to take action. A mission‐critical engineer doesn't stop after the first try. Mission critical engineers understand the importance of their positions and how their employers impact the public. They entered this industry to contribute to society. They are ethical, share their knowledge, and strive to motivate others.

I've been a mission‐critical engineer for close to 30 years and am still puzzled by some things. We all know what an investment of $500 million dollars buys. We invest this money because we think we are buying reliability and business resiliency. After this kind of investment, we are enamored with the infrastructure, and we feel confident that it will pe1fonn as designed when called upon.

Among the industries that have zero tolerance for error, the ones that stand out are aviation, rail, nuclear power plants, and, of course, NASA. You can call these industries “mission control” type industries, where error can lead to catastrophes, cascading failures, and loss of life, money, and reputation.

Are we falling short in fields that require this type of intolerance for error? As we are already aware, human error causes approximately 60 percent of all downtime experienced by mission‐critical facilities. This number is far too high. Today there are a growing number of DCIM tools that can help reduce downtime, but we are just beginning to scratch the surface in moving toward a significant reduction in downtime. We are still many years away from that goal of 'zero downtime.' There have been many recent examples of human error that have caused fatalities:

The crash of Air France Flight 447 that killed 228 people due to a lack of pilot training in surprise situations.

The head‐on collision of a Metrolink train near Chatsworth, CA, which was probably caused by an engineer who was texting, 25 people were killed and 135 injured.

The actions of the Costa Concordia captain before and after the collision that led to the death of 32 passengers.

Colgan Flight 3407 operated under Continental Airlines, which crashed, killing 49 people in the suburbs of Buffalo.

Either character flaws or a lack of training played a role in each of these disasters. All could have been avoided if the right people had been in these positions.

Beyond these man‐made disasters, we have natural disasters that are even more difficult to cope with. In the wake of Superstorm Sandy, we are once again reminded of how vulnerable our country's infrastructure is and how large‐scale disasters and catastrophes can produce extended downtime.

Sandy left millions without power in the tri‐state area, causing untold chaos and the worst gasoline shortages since the 1970s. There are so many ways to defend against these disruptions, from ensuring that the refineries have the appropriate standby or microgrids that are designed to support the critical infrastructure vital to the sustainability of how we live digitally today. How can we expeditiously improve? The critical infrastructure of our country is not something to be left so unprotected. It deserves to be as robust as any missioncritical industry in this country given its importance to health and safety as well as our financial system.

The issues surrounding Superstorm Sandy and the associated impact on transportation ‐ auto, air, and rail‐crippled Manhattan and some of New York City's suburbs for days. Although everybody did the best they could under the circumstances and the first responders deserve accolades, there is no doubt that the effects could have been mitigated with better disaster planning and associated coordination and an inventory of the right assets. The transformation of this industry must start with the workers. But they need the right tools to be successful, and this is where management comes in. The engineers and technicians are the foundations of success for this industry. Where do we get the m? How do we train them?

We are the new mission control, and we need to take a page out of the nuclear, aviation, and first‐responder industries to bridge the gap from a 60 percent human error to a statistic that approaches zero. There is a lot of collaboration and work to do. How do we make this industry a profession? How do we develop the 1ight character? How do we ensure continuous improvement? Having a college degree or mastering a trade is only part of the equation. What programs do we need to develop in our industry?

1.13 Conclusion

Everyday industries are becoming increasingly dependent on continuous business operations. As a result, companies need to understand the level of reliability that they can supply to their customers and evaluate how this can either be improved or maintained. The following chapters will reinforce the concept that reliability and resiliency are dependent on

Скачать книгу