Since the day of the very first computer, there were failures. Components burned out, circuits shorted or opened, solder joints failed, pins were bent, and metals reacted with each other. These and countless other failure methods plagued the computer industry from the very first circuit to today. Learning how to compensate for failure, understanding failure mechanisms, and how to predict computer failure has become a full profession in itself.
Failure rate predictions are utilized by logistics, systems, and reliability engineers for a myriad of purposes, including reliability analysis, cost trade studies, availability analysis, spares planning, redundancy modeling, scheduled maintenance planning, product warrantees, and guarantees.
Reliability predictions are very important to the management of a product life cycle. These predictions are necessary for many reasons, such as:
- Help assess the effect of product reliability on the maintenance activity and on the quantity of spare units required for acceptable field performance of any particular system. Reliability prediction can be used to establish the number of spares needed and predict the frequency of expected unit level maintenance.
- Provide necessary input to system-level reliability models. System-level reliability models can be used to predict frequency of system outages in steady-state, frequency of system outages during early life, expected downtime per year, and system availability.
- Provide necessary input to unit and system-level life cycle cost analyses. Life cycle cost studies determine the cost of a product over its entire life. This includes how often units and systems fail during the first year of operation as well as in later years, helping to establish total life cycle cost estimates.
- Assist in deciding which product to purchase from a list of competing products. As a result, it is essential that reliability predictions be based on a common procedure. Given that everything else is equal, reliability predictions can be a deciding factor.
- Can be used to set factory test standards for products requiring a reliability test. Reliability predictions help determine how often the system should fail, making it possible to determine if adequate testing is being performed.
- Are needed as input to the analysis of complex systems such as weapon systems and complex control systems. It is necessary to know how often different parts of the system are going to fail even for redundant components.
- Can be used in design trade-off studies. For example, a supplier could look at a design with many simple devices and compare it to a design with fewer devices that are newer but more complex. The unit with fewer devices is usually more reliable.
- Can be used to set achievable in-service performance standards against which to judge actual performance and stimulate action. Feedback can then be used to adjust testing procedures.
Reliability prediction methods
Accurate prediction of the reliability of electronic products requires knowledge of the components, the design, the manufacturing process, and the expected operating conditions. Once the prototype of a product is available, lab tests can then be utilized to obtain more accurate reliability predictions. Several different approaches have been developed to predict the reliability of electronic systems and components. Each approach has its unique advantages and disadvantages. Among these approaches, three main categories are often used within government and industry: empirical (standards based), physics of failure, and life testing.
Empirical prediction methods are based on models developed from statistical curve fitting of historical failure data, which may have been collected in the field, in-house, or from manufacturers. These methods tend to present good estimates of reliability for similar or slightly modified parts. Some parameters in the curve function can be modified by integrating existing engineering knowledge. The assumption is made that system or equipment failure causes are inherently linked to components whose failures are independent of each other. There are many different empirical methods that have been created for specific applications. Table 1 lists some of the commonly used prediction standards.
A physics of failure (PoF) approach is based on the understanding of the failure mechanism and applying the physics of failure model to the data. PoF analysis is a methodology of identifying and characterizing the physical processes and mechanisms that cause failures in electronic components. Computer models integrating deterministic formulas from physics and chemistry are the foundation of PoF.
With the life testing method, a test is conducted on a sufficiently large sample of units operating under normal usage conditions. Times-to-failure are recorded and then analyzed with an appropriate statistical distribution in order to estimate reliability metrics. Operating conditions are often accelerated and amplified to compress lifetime wear and tear into a manageable test time measured in days or weeks. This testing is often called Life Data Analysis, Weibull Analysis, or Highly Accelerated Life Test (HALT). Some time-to-failure data from life testing may be incorporated into some of the empirical prediction standards (i.e.: Bellcore/Telcordia Method II) and may also be necessary to estimate the parameters for some of the physics of failure models.
Failure of the methods
The old methods of predicting reliability in electronics have begun to fail us. MIL-HDBK-217 has been the cornerstone of reliability prediction for decades. But MIL-HDBK-217 is rapidly becoming irrelevant and unreliable as we venture into the realm of nanometer geometry semiconductors and their failure modes. The uncertainty of the future of long established methods has many in the industry seeking alternative methods.
On the component supplier side of the equation, semiconductor suppliers were seeing such increases in component reliability and operational lifetimes that they slowly began dropping MIL-STD-883B testing and nearly all have dropped their lines of mil-spec parts. Instead they have moved their focus to commercial-grade parts where the unit volumes are much higher. The purchasing power of the military markets has become insignificant to the point where there is no longer any leverage. Instead, system builders took the commercial-grade devices, sent them out to testing labs, and found that a large majority of them would, in fact, operate reliably at extended temperature ranges and environmental conditions. Field data gleaned over the years has improved much of the empirical data of complex algorithms for reliability prediction.
A new set of problems have arisen with smaller die geometries. Previous semiconductor generations were showing operational lifetimes of 10-15 years or more. However, empirical evidence is now showing that nm-geometry integrated circuits (ICs) are wearing out in just 3-5 years. The small geometry parts are plain wearing out faster, and the commercial users really don’t care. They would rather see consumers replace their smart devices every two or three years so the shorter life cycles play into their product strategies. However, with multi-billion dollar weapons platforms, those life cycles just don’t fit the model.
Reliability Community to the rescue
Several years ago VITA members saw the need for improving the consistency and traceability of reliability prediction (MTBF) data for electronic devices used in defense and aerospace applications.
The Reliability Community working group was formed to investigate and develop industry standards to address electronics failure rate prediction and assessment.
The community is comprised of representatives from electronics suppliers, system integrators, and the Department of Defense (DoD). The majority of the work is driven by the user community that depends so heavily on solid reliability data. BAE Systems, Bechtel, Boeing, General Dynamics, Harris, Lockheed Martin, Honeywell, Northrop Grumman, and Raytheon are some of the demand-side contributors to the work done by the Reliability Community. These members have developed community of practice documents that define electronics failure rate prediction methodologies and standards. The efforts have produced a series of documents that have been ANSI and VITA ratified. Where applicable, these standards provide adjustment factors to existing standards.
The Reliability Community addresses the limitations of existing prediction practices, with a series of subsidiary specifications that contain the “best practices” within industry for performing electronics failure rate predictions. The members recognize that there are many industry reliability methods, each with a custodian and acceptable practices to calculate electronics failure rate predictions. If such a method is identified as requiring additional standards for use by electronics module suppliers, a new subsidiary specification will be considered by the working group.
ANSI/VITA 51.0 Reliability Prediction and its subsidiary specification – ANSI/VITA 51.1 Reliability Prediction: MIL-HDBK-217 – define consistency and repeatability for mean time between failure (MTBF) calculations (see Figure 1). The intention is to supplement MIL-HDBK-217.
ANSI/VITA 51.2 PoF Reliability Predictions defines standard methods for using physics of failure in reliability prediction.
ANSI/VITA 51.3 Qualification and Environmental Stress Screening in Support of Reliability Predictions provides information on how qualification levels and environmental stress screening (ESS) influences reliability.
Current status
Work continues in all of these specifications. Adjustments have to be made for new components and new thinking on failures. As new electronics technology is developed, new methods will be developed, documented, and added to future releases of these standards and subsidiary specifications.
The VITA Reliability Community invites participation in both the development and implementation of the documents. The Reliability Community maintains a LinkedIn group, which is open to anyone with an interest in reliability predictions: www.linkedin.com/groups?home=&gid=4701272
For more information or to get involved in development of these specifications, visit www.vita.com/reliability.