The complexity of FPGA and ASIC designs is becoming an issue for many military, aerospace, and other safety-critical industries. Integration of numerous components into a single System-On-Chip (SoC) is rapidly occurring in this segment. What was once a discrete component is now merely a block in a bigger system, but all within a single chip. Each of these blocks has its own function and likely its own clocking requirements. Because of this, the number of independent (and asynchronous) clock domains is on the rise. A 2004 study[1] showed that the average number of clock domains on a single device was between 5 and 10, and was projected to be greater than 15 by 2006. Today, this number is likely higher. This means that the probability of bugs due to CDC issues is also growing substantially.
One of the most pervasive and insidious CDC problems is metastability, which occurs anytime a register connecting multiple clock domains temporarily goes into an indeterminate state, causing intermittent design failure. Metastability can be a dangerous thing in that it causes devices to fail intermittently in ways that are extremely difficult to diagnose. Failures, such as those caused by metastability, are unacceptable in safety-critical systems. However, if the problem is well understood, various techniques – both on the design and verification side – can address metastability, ensuring that safety-critical designs do not fail because of it. Automated tools designed specifically for CDC verification overcome the shortcomings of manual approaches and ensure comprehensive coverage – and resolution – of the metastability problem.
DO-254 and safety-critical design
Document RTCA/DO-254, Design Assurance Guidance for Airborne Electronic Hardware (referred to herein as "DO-254"), is intended to ensure hardware reliability for the purpose of flight safety. DO-254 defines a process that hardware vendors must follow to get their hardware certified for use in avionics systems. Certification authorities worldwide enforce its use for all in-flight hardware (that is, FPGA or ASIC designs).
Verification is an important aspect of DO-254. Not only must Complex Electronic Hardware (CEH) designs be verified to ensure they meet the system requirements, but also any hardware-specific aspects of a design that might impact proper operation must be verified. One hardware verification method that DO-254 mentions is Design Margin Analysis, which is defined as any method that "verifies that the design implementation satisfies its functional requirements given the variability of components."[2] CDC fits within this category of analysis, as the variability of clock timing between independent domains can impact device function.
The problem with clock domain crossings
A CDC signal is one that originates in one clock domain and is sampled by a register in another. Any ASIC or FPGA with more than one clock has CDC signals and must be designed and verified to ensure that it is free of CDC issues. Metastability is the term used to describe what happens in digital circuits when the clock and data inputs of a flip-flop change values at approximately the same time. This leads to the flip-flop output oscillating and not settling to a value within the appropriate delay window, as shown in Figure 1. In this case, the output of the flip-flop is said to have "gone metastable." This situation happens in every design containing multiple asynchronous clocks.
Failures due to metastability generally go undetected during simulation (which tests a chip’s logic functions) and static timing (which tests for timing within a single clock domain) because these verification methodologies do not consider potential bugs from CDC. Without explicit testing, these types of bugs are only caught in the actual hardware device in the lab or in the field. Catching them in the lab is expensive and time consuming, often prohibitively so. For DO-254 and other safety-critical projects, catching faulty operation in the field could have catastrophic results.
Designers are generally aware of the metastability problem and try to implement their designs to isolate the outputs of the metastable registers such that the metastable value cannot propagate into the rest of the design. For example, savvy designers add synchronizers between clock domains, create protocols for transferring data between domains, and try to avoid situations where data from multiple clock domains reconverge.
However, as shown in Figure 2, it is quite easy to leave out needed synchronizers, or to place one incorrectly so that it does not work as expected. Even careful manual code reviews typically miss these problems. For example, reconvergence issues, one of the most common causes of metastability, are almost impossible to find through manual code reviews. Multiple signals are said to reconverge when they are transferred from one clock domain to another and are used together to perform some logical function. If these signals have an assumed relationship, then CDC reconvergence errors will occur. The effects of CDC issues can be highly data dependant and may only exhibit themselves in unusual situations when a combination of particular data values crosses the CDC boundary while the design is in a specific vulnerable state.
Designing to eliminate metastability
Most companies have best practice requirements for designs containing multiple clock domains. These requirements often include placing specific clock synchronization schemes manually, naming CDC signals so they can be identified and reviewed, and ensuring the design is constructed so CDC signals are restricted to only subsections of the design.
Unfortunately, even with strict design regulations, things can and do go wrong. For instance, a designer might fail to realize a signal is coming from a different clock domain and, thus, unintentionally violates a requirement. Conversely, a designer might realize a signal is coming from a different clock domain but chooses a synchronization scheme inappropriate for the design (that is, using individual synchronizer bits to synchronize a data bus, which results in corrupted data values in the real hardware). This is especially common when designs are reused and changes are made to an existing design. In this situation, it is easy for a designer to use signals without realizing they come from another domain, combine CDC signals into a state machine without considering the advancing/receding nature of synchronized control signals, and so on.
Verifying DO-254 safety-critical designs
After design has been completed, the next step is DO-254 verification, which can be executed manually or via an automated CDC verification tool.
Manual verification issues
Manual verification methods can be quite problematic. For example, while an extensive manual code review could find structural issues, it would be tedious, time consuming, and error prone. In addition, manual reviews typically cannot ensure that transfer protocols are used correctly and rarely address reconvergence issues. Further, the reviewer can:
- Miss a signal altogether
- Identify a signal, but fail to realize it crosses to a different clock domain due to multiple fan-outs, misread a sending or receiving clock signal name, and so on
- Correctly identify all CDC signals, but fail to realize there is no synchronizer in place
- Identify all CDC signals and assure all synchronizers are in place, but not realize the synchronizer is incorrect (that is, synchronizing multiple bits of a data bus using independent synchronizers, and so on)
- Identify that all CDC signals and synchronizers are correct, but miss the fact that combinational logic placed in or around the synchronizer invalidates the timing requirements of the synchronizer
To complicate matters, DO-254 projects require that verification and design be done independently. This is good in many regards, but when it comes to safety-critical, multi-clock designs, it is not. Verification engineers generally are not as well versed in design as the designers themselves and often do not recognize CDC issues.
Automating CDC verification
Even if the verification team does recognize the problems associated with CDC, verifying metastability effects by hand is extremely difficult and error prone, as there can be literally hundreds or even thousands of CDC signals. Therefore, companies benefit greatly by using an automated tool designed specifically for CDC verification to bridge the knowledge gap between design and verification teams and to ensure comprehensive coverage of these issues.
A comprehensive CDC verification tool, such as that offered by the Mentor Graphics 0-In CDC tool, must include the following three capabilities.
1. Perform a structural analysis. This is most effectively done on the RTL code to identify and analyze all signals’ crossing clock domains and determine if their synchronization schemes are present and correct and to:
- Identify the clock structure used in the design (clock domains, clock gating, dividers, and so on)
- Identify all CDC signals in the design
- Determine the synchronization scheme (if any) used on the signals
- Check that each synchronization structure is implemented and used correctly
While the process is automated, the user has the ability to guide the tool by providing additional information on clock groups, preferred synchronization types, exceptions, and many other options. If a problem is identified in the structural synchronization, it can be debugged using an interactive environment specifically designed to simplify working with CDC signals.
2. Verify transfer protocols. The automated CDC verification tool assures that the synchronization schemes are used correctly, by monitoring and verifying that protocols are being followed during simulation. This is accomplished via the use of advanced assertion-based verification techniques. Using the information extracted from the design during structural verification, CDC protocol monitors are automatically created. The monitors contain assertions that check whether the appropriate CDC protocol is followed at all times, regardless of the current clock relationship.
The assertions do not require that a violation causes simulation to fail. Any violation during simulation is automatically captured by the automated CDC verification tool, allowing the designer to easily debug the problem. Additionally, the monitors capture critical coverage information, allowing the verification team to quantify the quality of the CDC verification. When running multiple simulations, the results can be concatenated together, allowing any coverage holes in the CDC testing to be exposed.
3. Globally check for reconvergence. This is most effectively done by injecting the effects of potential metastability into the simulation environment and determining how the design will react. 0-In CDC includes the CDC-FX technology, which integrates with existing simulation environments to introduce metastability effects in order to reproduce the same variable delay behavior that would occur in the final hardware. These metastability injectors are automatically placed into existing RTL simulations and on all CDC paths – even those that do not use structured synchronizers. These injectors pseudo-randomly inject the effects of metastability into CDC signals only at appropriate times when metastability could occur in hardware. If the design is not tolerant of these effects, due to reconvergent CDC paths, a functional error will be stimulated, which can be debugged using traditional techniques.
Reducing the risk of metastability
The number of independent clock domains is on the rise, resulting in a higher risk of metastability. Metastability is a dangerous occurrence in safety-critical designs. Designers typically understand the process, but addressing all aspects of metastability manually is very difficult. Thus, thorough verification specifically targeted at the CDC problem is imperative to ensure safe device operation. Automated CDC verification has been available for years and has been proven on thousands of industry designs. Thus, the problem of undetected metastability bugs can be solved, but only if it is understood and addressed during verification. Every multi-clock, safety-critical design, especially those subject to DO-254 compliance, should specifically run automated CDC checking – as opposed to the less-effective manual verification method – as part of a thorough verification process. CS
References
[1] 2004 IC/ASIC Functional Verification Study, Collett International Research. Used with permission.
[2] RTCA/DO-254 "Design Assurance Guidance for Airborne Electronic Hardware" section 6.3.
Mentor Graphics Corporation
503-685-1768
www.mentor.com