Upgraded approaches for "safe" and "reliable" COTS designs

Technology buyers and users are so ingrained with the concept of “safe and reliable” operation that it is easy to perceive those terms as being interchangeable. But while some safety-critical or mission-critical applications – such as avionics or rail transit – do need to perform under extreme temperature, shock, vibration, or exposure to dust and moisture, it is important to design and evaluate system safety and reliability independently. Fortunately, evolving COTS hardware and software architectures and design strategies are making it easier for systems developers to incorporate both “safe” and “reliable” performance into their embedded computing systems.

Defining safety terms and requirements

When it comes to evaluating equipment safety, not all failures are equal. A “safe” system is one with a determined error behavior that can be accommodated in the application without a negative result.

The two basic modes of error behavior are fail-safe and fail-operational. A fail-safe application in a medical or an industrial-process environment is significantly different from the fail-operational demands of an avionics or nuclear power application. For example, in certain industrial or medical systems where human supervision is available to respond to problems, it is acceptable for systems to be designed as “fail-safe” (i.e. eliminate danger by shutting down when they fail).

But in applications where immediate shut-down is unacceptable – such as nuclear control systems or aircraft flight control – systems must be fault-tolerant or “fail-operational” (i.e. continue to function even if specific components fail).

Each industry has its own major safety standards relating to the specific requirements of its applications. These include DO-178B and DO-254 for airborne systems, the EN50126/50128/50129 standards for railway applications, and IEC 61508 for industrial applications. The IEC standard even specifies separate Safety Integrity Level (SIL) ratings for systems operating in low-demand mode versus those operating in high-demand or continuous mode, and also has offshoot standards relating to sector-specific industrial applications.

Different standards and terminology exist to define the reliability of components and systems. For example, the semiconductor vendor typically deals in Failures In Time (FIT), while equipment manufacturers use the term Mean Time Between Failure. However, whichever terminology is used, it is important to calculate the cumulative effect of each individual component’s rate of failure in order to arrive at a reliability rating for the overall system (Figure 1).

**Figure 1:** The overall safety of a particular design must take into account both safe faults and dangerous faults. COTS equipment with redundant components and ECC facilitates safe and reliable designs.

Designing for safety-critical applications

Complex safety-critical architectures involve not only the hardware and software, but also the whole development process and even the tools used. On top of that, system designers are constantly pressured to achieve an optimum balance among the competing demands of cost, space/weight constraints, power consumption, system compatibility, and life cycle.

Fortunately, as COTS hardware continues to evolve for embedded system applications, new solutions are being developed to address ever higher levels of performance – for safety in mission-critical/life-critical applications as well as for reliability in hostile/harsh operating environments. These include efficient SBCs with integrated triple-redundant processing and extensive Built-In Test Equipment (BITE) features.

Safety-critical SBC case study

A 6U VMEbus SBC can be designed with multiple hardware and software features to provide triple-redundant functionality to minimize opportunities for common-cause errors. This includes using 3x redundant PowerPC processors, 3x redundant 512 MB DDR RAM, and 3x redundant Power Supply Units (PSUs) on a single board, for example. Such an SBC can also use an FPGA design to manage the redundant processors and memory in a way that can take advantage of software designed for a single CPU board, thereby minimizing development time, effort, and cost.

A 2-out-of-3 majority-rule voting regimen can be implemented via the FPGA, so that if one processor/memory/power-supply channel experiences a fault, the system follows the result of the remaining two processor channels in agreement. The board’s three processors could feature clock speeds up to 900 MHz in lock-step mode with software assisted resynchronization; this assures that each CPU performs the same function at the same time. The triple-redundant memory would automatically correct hardware faults and Single Event Upsets (SEUs) caused by cosmic radiation.

Basing the SBC on FPGA technology would also deliver the flexibility to manage I/O functions for an RS-232 interface and other I/O such as UARTs, an I2C bus, and PMC slots, without altering the board’s layout. These features would enhance long-term availability, support extended-temperature operation, and provide redundant flash memory with ECC to compensate for SEUs such as those experienced in avionics applications. Integrated BITE features could add further value for monitoring operation and detecting latency errors before they lead to a system error. One board that provides these reliability capabilities is the A602 PowerPC SBC from MEN Mikro Elektronik (Figure 2).

**Figure 2:** MEN Mikro Elektronik’s A602 PowerPC-based 6U VME SBC

Expanding ruggedization strategies for long-term reliability

While there are proven techniques to mitigate environmental concerns, it is usually better to take a proactive approach that minimizes problems through the initial design. Thermal performance in addition to shock and vibration are paramount considerations.

Thermal performance

To reduce the need for add-on thermal management strategies, a developer might first look for low-power hardware that can minimize heat generation and the need for remedial thermal designs. For high-performance applications, this can mean using multicore processors with large caches.

Multicore processors can provide excellent performance without the high clock speed and high power consumption normally required to achieve comparable performance in a single-core CPU. Large caches further enhance efficiency by keeping frequently used data in low-latency memory. The net result is less heat and greater performance per watt plus longer battery life, especially useful in mobile applications.

Where cooling is needed, convection cooling – guiding airflow along the surface to be cooled – is often the easiest method. While mechanical setup can be simple, cooling fans do have a limited lifetime – even fans with claims of over 100,000 hours of MTBF. And the constant flow of air can introduce dust and moisture into a system, although conformal coatings can be used in those instances to improve the degree of component protection.

By contrast, sealed conduction-cooled designs provide greater flexibility than convection, particularly for extreme conditions up to +85 °C. Conduction cooling uses direct thermal contact to draw heat from the active electronics to the outer wall of the enclosure, turning the entire enclosure into a heat radiator. Suitable measures must be taken to maximize efficient thermal transfer between the electronic component being cooled and the enclosure wall.

The most practical cooling method depends on the unique operating requirements of the application. That is why it makes sense to work with a vendor who understands the application and is able to qualify systems to the appropriate standards. An experienced vendor can also provide features beyond basic cooling, such as board management controllers with integrated watchdogs that can supervise the processor and board to deliver added thermal protection.

Shock and vibration

In designs subjected to jarring effects resulting from the environmental rigors found in mobile, railway, and avionics applications, specifying boards with soldered components can provide a higher degree of reliability according to the applicable DIN, EN, or IEC industry standards. Another contributing factor to robust shock and vibration performance is the selection of connector designs with good mechanical stability. And on top of those precautionary designs, using rigorous tri-axial vibration testing combining rapid high- and low-temperature changes with repetitive shock and vibration simultaneously in all three axes provides a higher degree of confidence than standard xyz-axis vibration testing that only evaluates one axis at a time. This last factor is common in Highly Accelerated Life Test (HALT) and Highly Accelerated Stress Screening (HASS) testing.

The final analysis: Leverage suppliers for added value

Beyond the previously mentioned design and evaluation criteria, there are advantages to teaming up with experienced suppliers working under certified lab conditions. Those with in-house capabilities for environmental, EMC, and high-voltage testing can provide preliminary qualification of safe or rugged designs. These include extreme temperature and humidity testing as well as standard EMC tests including ESD, burst, and radio disturbance. Suppliers with HALT and HASS capabilities can also help identify potential weaknesses in electronic and mechanical designs.

It is also advantageous to work with a supplier who can monitor the manufacturing process through Automatic Optical Inspection (AOI) and provide 3D inspection during component placement and solder-process monitoring to prevent components from being overheated, in addition to automatic 100 percent function testing at the end of the process. Traceability of all components from incoming goods through dispatch of the finished product is another desirable trait, since parts traceability can help pinpoint solutions if problems do occur in the field. This is a must for railway, avionics, automotive, and medical applications. Finally, vendors with obsolescence management programs can also help ensure continuity of supply to keep a safe and reliable design viable for a long and productive system life cycle.

Barbara Schmitz has served as Chief Marketing Officer at MEN Mikro Elektronik GmbH since 1992. She graduated from the University of Erlangen-Nürnberg. Later, she studied business economics in a correspondence course at the Bad Harzburg business school and followed an apprenticeship in Marketing and Communications in Nuremberg. Barbara can be contacted at [email protected].

MEN Mikro Elektronik GmbH US Tel# 215-542-9575 Germany Tel# 49 (911) 99 33 5151 www.men.de