“Failure is not an option” is a statement that has become synonymous with mission-critical systems and applications. The statement, at its surface, has an ideal meaning signifying no failure. However, in a complex system wherein any component has the potential of being the crucial link, being aware and responsive of a subsystem fault could be the difference between a successful operation and a disaster.
The decision to execute a critical mission must be made only if the complete system is operating correctly. The idea and application of Built-In Self Test (BIST) has become an integral part in supporting the “failure is not an option” requirement in critical systems including weapons, avionics, and “unattended” military applications (for example, remote radar and ground sensor networks, among others).
Past embodiments of BIST have typically been reserved for integrated circuits and single board computers. However, a greater emphasis on entire systems incorporating BIST has emerged. Typically, embedded systems employ myriad subcomponents. Each of these subcomponents can include its own detailed BIST whereby the culmination of each reflects into a BIST for the whole system. The following discussion presents examples pertaining to an I/O subsystem, but the overall concept remains similar for individual BIST at the subsystem level to complete a BIST for an integrated system. Amir explores different BIST methods and their advantages in meeting SWaP while increasing Mean Time Between Failure (MTBF) and reducing system hardware/software and resultant cost overhead.
BIST: Form, function
BIST has many definitions in different subsystems in an embedded application. The most typical broad-range definition is “self-test”: the ability of mechanisms to self-report the status, health, and operational level of individual components or Line Replaceable Units (LRUs) for the overall system. Some companies have supported and embraced the customer- and industry-driven need for BIST in their board-level products. However, to the typical systems designer and integrator, there are variables in the definition of BIST coverage and costs associated that need to be weighed and balanced. Engineers are often nagged by the question: How do I check that my I/O design is working properly, and how serious are the consequences of a failure? The problem most designers face is not so much a determination as to whether BIST is important, but rather in understanding the different types of BIST, weighing its advantages, and predicting implementation costs of a particular BIST scheme.
LRUs or component products might typically employ two methods of BIST. The first is an “offline” test and diagnostics embedded routine. Typically, the offline BIST method is invoked upon power-up (Power-On Self Test or POST), and subsequently the operator can initiate and maintain some rudimentary control for measurement modules, or, for output signal modules, an onboard measurement circuit or mechanism is available for direct feedback. For example, an I/O subsystem typically incorporates multiple D/A channels for driving various sensors. An onboard measurement circuit to read back the D/A’s actual output can be incorporated and controlled during POST. This could also be controlled offline to command outputs with immediate measurement feedback via status register or interrupt.
Typically, this type of POST BIST testing needs to be rather benign as not to affect connected devices adversely. There is no guarantee what might be connected to I/O pins. Any POST or offline BIST testing must ensure complete isolation to the external interface, and the self-test design itself cannot jeopardize connected devices. The offline testing ensures that the LRU, during the power-on initialization boot process, is operating correctly and has been initially tested to guarantee operation. But this only “guarantees” operation immediately after power-on. Some POST BIST routines are written to be re-invoked after power-on. This allows the test cycle to be initiated at other times during system operation.
The second BIST method, which is in greater demand and a much more unique approach, is an “online” test and diagnostics routine, whereby direct, immediate status is available while the cards (or subsystems) are in normal operation. This technique allows for monitoring the specific channels or actions in real time whereby actual system operation is still online and functioning normally. Without the need for redundant or external monitoring, the card continuously performs an internal BIST during operation. An error or out-of-tolerance condition is continuously monitored whereby a status register or interrupt can be enabled to automatically take appropriate action. For example, in a servo I/O control loop system, a synchro device is attached to a positioning system for a targeting device. If an out-of-tolerance channel reading of the synchro device is detected during normal operation, a status flag and interrupt are generated, warning the targeting designator and bringing a backup system or channel online to complete the mission. (See Table 1 for a comparison of BIST methods.)
The targeting device in a mission-critical military application might have been the turret or launch platform of a fighting vehicle. In an industrial application, this might have been the control arm of a painting device whereby collision of the paint gun and chassis was averted. Thus, safety and reliability are key design features in any mission-critical system. However, SWaP reductions are also important design features that system designers or integrators are always on the watch for. (Figure 1, courtesy the of U.S. Army, shows the THAAD missile.)
System considerations: BIST and SWaP
SWaP impact is of major consideration to systems designers and integrators. Volumetric space is saved by reducing the card or subsystem “count.” Capability to define custom off-the-shelf technology with BIST features is key. The system designer or integrator can provide most, if not all, system I/O requirements on one card. BIST is included on each module type whereby only one channel is needed per function. Since the module can handle the BIST requirements on a single channel, there is no need to wire-in or connect a redundant channel. This reduces the overall card count as compared to systems that employ a redundant hardware complement to test the online channels.
Overall system weight is reduced by utilizing multifunction design and on-module built-in testing. Hand-in-hand with the space reduction, a reduced overall card count will invariably reduce the overall system weight. In addition, weight savings are also realized in a reduced cabling requirement. The fewer channels required to wire up into a redundant monitoring or control application, the less external harnessing is required.
Space and weight reductions incurred by minimizing the hardware or card count renders an immediate savings in power. Reducing the redundant card requirements lowers power consumption for the subsystem. This then reduces the overhead for a larger system’s power supply requirements, which also adds to savings in space and weight (Figure 2).
BIST increases MTBF
The incorporation of onboard BIST also results in a large advantage in MTBF calculations as well as overall systems-level maintenance and logistical and bottom-line project costing. System-level MTBF calculations will increase significantly because BIST hardware is designed into and at the card-module level. This will reduce the overall hardware requirement, which not only reduces the component count but could affect and reduce temperature/power dissipation, directly affecting MTBF as well. Logistically, the spares or new system build requirements are directly related to the overall I/O subsystem.
BIST reduces software/hardware overhead
Onboard BIST has an important advantage sometimes overlooked by hardware-driven systems designers. Software overhead in systems or subsystems utilizing onboard BIST is greatly reduced. Redundant hardware-supported BIST requires, at a minimum, two system control loops: one main loop to handle the direct I/O requirements and then a secondary, possibly embedded loop to handle redundant read requests for the comparison and BIST decision making process. In a real-time embedded system, wherein close timing and feedback action of a control loop might be necessary, this constant read/polling with redundant hardware might not even be a consideration due to time lags and slow action response.
Additionally, onboard BIST, when properly utilized, can be interrupt driven. Different interrupts could be designated with different software vectors, whereby a critical fault detected could determine a shutdown or immediate action requirement, and a noncritical fault detect might just illuminate an indicator.
In an avionics example, BIST faults regarding specific channels can be handled differently based on the channel control or sensing application. For example, on a flight’s final approach, there are many conditions, checks, and sensing and control indications requiring immediate actions. BIST fault support utilized on sensing “landing gear down” would require an immediate verification or response to alleviate the error condition. If the BIST fault was tied to “lavatory door closed,” this could be deemed “noncritical” and might require the flight attendant to verify the door is indeed physically latched. Either way, discerning if a fault condition exists in real time, distinguishing between a critical and noncritical function, and taking appropriate action based on function type ensures the overall system’s safety and reliability.
BIST is the superlative choice
Onboard online inclusive (background) BIST testing at the LRU or board level will generally be the most cost-effective test method and have the least impact to implement in terms of hardware and software integration. Online BIST reduces system-level design effort in terms of added redundant hardware and software overhead. This not only lowers the required board count by reducing the need for complemented hardware utilized for testing, but also increases MTBF while reducing SWaP and the associated bottom-line system and logistics costs. By specifying and incorporating online BIST within each LRU component of an overall system, a more comprehensive BIST can be implemented. Because each LRU is tested specifically for the design criteria of the subsystem, the overall system achieves greater component test coverage and fault detection: systems design with greater safety and reliability, but with lowered cost. Incorporation of online BIST at the LRU (or board) level is a win-win strategic implementation to accomplish the mission.
Amir Shafy is a senior applications engineer for North Atlantic Industries (NAI), an independent supplier of embedded I/O boards, power supplies, and motion simulation and measurement instruments for the military, aerospace, and homeland security industries. Amir has held integration and design engineering positions for rugged MIL peripherals, and embedded and portable computing. He holds a Bachelor’s degree in Electrical Engineering and Technology (BSEET) from the State University of New York (SUNY) at Farmingdale.
North Atlantic Industries
631-567-1100