What we do and don’t believe at VITA

The two questions I’m asked regularly are:

Why did we avoid putting live-insertion mechanisms on VITA’s architectures?
How does the new V-58 Line Replaceable Unit (LRU) specification relate to the liquid-cooling specifications (V-48, V-50)?

No live insertion at VITA: It’s an over-hyped buzzword and an obvious myth

Live insertion (removing a board out of the backplane and putting in another board while the system is powered on) is just one element of a fault-tolerant system model. There are five phases to the fault-tolerant model defined by VITA and its members during the Futurebus standards efforts in the early 1990s. These include:

Fault detection
Fault identification
Fault isolation
Fault correction (for example, removing the bad board and inserting a new one without powering down)
System realignment (for example, recognizing the new resources on the newly inserted board and bringing them into the system map)

At this point, you see where live insertion fits into the fault-tolerant model, and you can also see that the definition of powered on becomes critical to what is actually happening inside the machine. If the power is on and the data buses are operating (for example, data is being transferred across the backplane), the machine is operating under the fault-tolerant model restrictions and the other four phases of the fault-tolerant model must be in place. That means huge mountains of very specific, tuned software to handle all the phases mentioned and to insure that the machine keeps operating in spite of any technical problems with any of the electronics. Not since the Stratus machines have we seen any computers operate like this. You cannot live insert a board into a backplane while it is running without all five elements of the fault-tolerant computing model, especially if the backplane uses a bus.

Now, if the machine is powered on and the buses are quiesced (no data being passed), you can remove bad boards and insert new boards by simply controlling the power feeds to that particular board slot. This is configuration maintenance, not live insertion. From my perspective, live indicates that the machine is operating, running its software, and passing data on the backplane. Configuration maintenance just allows you to remove and replace cards by controlling the power to the board without rebooting all the software each time, plus the machine is not running application code during this process. However, if you remove and insert new cards without rebooting the software, it is clear that the old software in the machine must be realigned (matched) with the resources and functions on any new board inserted. This says that phase five (realignment) must be present, and there must be some software on the machine that identifies the new resources, finds the appropriate software to run those functions on the new card (drivers, interrupt handlers, register address maps, and others), and puts that software in place. Consequently, the term live insertion is a terribly misused term. It is really configuration maintenance.

With processor chips and I/O chips becoming obsolete in a matter of months these days due to Restriction of the use of certain Hazardous Substances (RoHS) and technology advances, you could not possibly get the fault-tolerant software to stabilize and be reliable. If the hardware is moving around every 18 months, that means the software is moving around every 18 months. Live insertion, as hyped today, is a myth. That is why none of the new VITA standards cover any live insertion or fault-tolerant techniques. There is no need for it. Plus, if you did include the hardware to allow extraction and insertion of cards under power, the customers and applications we have at VITA do not want to pay for it. Besides, with serial fabrics now on the market, redundancy (multiple machines running together in a fail-over mode) is a much easier and cost-effective solution. The myth of live insertion and its benefits have been overcome by reality and technology advancement (particularly serial fabric networks replacing buses).

Liquid cooling and LRUs: A perfect fit

The VITA Standards Organization (VSO) has been the leading standards group developing techniques for liquid cooling of high-density hot electronics. Using fans (forced-air cooling) will cool about 1 to 1.5 W per square inch of board surface area. Liquid cooling can cool about 200+ W per square inch right now, and more in the future. It is clear that with more efficient liquid cooling, you can pack more hot silicon on a given-sized PCB. That reduces space and weight, two critical requirements of VITA members and the applications that VITA’s new standards target. Over time, electronics always get smaller, faster, and cheaper. They also get hotter and are tougher to cool.

If you are liquid cooling a big rack full of cards, you can run all the pipes and valves to each board in the backplane, but that gets complicated and expensive. You can also cool the chassis itself using cold walls (liquid coolant flowing through the endplates of the chassis), and use conduction-cooling techniques to transfer the heat from the cards to the cold walls. This works well in many of VITA’s applications, and our members have announced a number of products that do this. If all the cards in such a system are using a bus (VME) to communicate, this is probably the way to go.

But, not all cards in the rack need to be liquid cooled. Some can operate very efficiently with forced-air or conduction cooling. So, liquid cooling all the cards in a liquid-cooled rack is overkill and adds cost to the system. Additionally, we have high-speed serial fabrics as the foundation of many new VITA standards. That allows us to group certain cards into nodes on a network. Some high-performance computing nodes will require liquid cooling. Other groups of cards will run perfectly with conduction or forced-air cooling. Consequently, now we can group cards and processors into functional nodes and subsystems in a network. Furthermore, to take advantage of this new ability, we need a new packaging concept other than the Air Transport Radio (ATR) box or the 19-inch rack. Hence, we have V-58, Advanced Electronic Packaging, or LRUs.

An LRU is about the size of a shoebox, slightly larger. All the PCBs inside the small box are connected to an internal backplane, and the box plugs into a metal frame. The connectors on the rear of the box plug in as a node in a network, plug into their power connections (120 Vac, 240 Vac, 400 Hz AC, 12 to 18 Vdc, 24 Vdc, 48 Vdc, and so on), and they also plug into their appropriate cooling systems (liquid, air, or conduction-cooling surfaces). Now we can densely pack all the electronics into a small space, and we can cool them very efficiently and cost effectively.

However, there are more and better benefits to using LRUs. Imagine that one of the LRUs in a rack has failed and has a bad board inside. We can pull that LRU out while the rest of the system is running (with power on), we can insert a new LRU, and start it up. (It is just a node on a serial fabric network.) This process takes only a few minutes, and the system is running perfectly again. This concept is commonly referred to as two-level maintenance. The repair technicians concentrate on getting the system up and running in less than 30 minutes, and they never touch a PCB or troubleshoot to the board or chip level. Taking the LRU apart and finding the bad chip is done at a central repair depot, not in the field. And, all this can be done by people with very little training (10 minutes maximum) and no tools (not even a screwdriver), allowing even the most complex electronic systems to return to service in less than 30 minutes.

So, now you know why VITA does not do any live-insertion standards: They are not needed. Also, you know how LRUs (V-58) and the new cooling standards (V-48 and V-50) fit together to solve thermal problems. Next time, I will explain why an octagonal mezzanine card form factor makes much more sense than a rectangular four-sided card.