Moving into embedded supercomputing

Massachusetts Institute of Technology (MIT) and other universities announced new algorithms for data-driven applications recently. Advanced Fast Fourier Transforms (FFTs), SWARM algorithms for Unmanned Aerial Vehicles (UAVs) and Unmanned Underwater Vehicles (UUVs), algorithms for extracting “fat tail” data in radar/sonar applications, and new beam-forming algorithms for SIGnals INTelligence (SIGINT) are examples. These new algorithms need advanced supercomputing architectures such as VPX.

Start with a hypercube

At the May VITA Standards Organization (VSO) meetings, a proposal to add the profiles for 4-dimensional and 6-dimensional hypercubes in the VITA 65 (OpenVPX) specification was accepted. When you start hooking together 8 or more CPUs, you must think about computer architectures in greater than 3 dimensions. The first 4D architecture is a hypercube, a tesseract. Many of the new algorithm-driven applications could require more than 8 CPUs, so the fourth dimension is a good place to start.

In the early 1980s, David May and Robert Milne of Inmos developed a new microprocessor chip, the Transputer. They hooked 16 processors together, using the slow serial links, into a 4D hypercube architecture. The machine ran great, but the data links were way too slow. Each of the processors was data starved. Even with the multigigabit fabrics available today, processors in a hypercube are still data starved, depending on the data sharing patterns between the nodes.

In any n-dimensional architecture, the worst-case number of hops (how many nodes the data must pass through before it arrives at its destination) is the number of dimensions of the architecture (n). In a 4D hypercube, the worst-case number of hops is 4 (Figure 1). To overcome this hop latency, you must put the applications that share the most data on the CPU nodes that are closest to each other (parsing).

**Figure 1:** 4D hypercube, 4 nodes with 4 processors each. The shortest path between modules is never more than four links. Image courtesy of VITA.

Hooking-up 16 processors

The n-dimensional architectures minimize the number of links on each node. The number of full-duplex links required per node is also the number of dimensions of the architecture (n). So, for a 4D hypercube (16 processors), each CPU board needs 4 bidirectional links. Compare that to the worst-case 2D architecture, a mesh of 16 CPUs: The number of links per node is (n-1), where (n) is the number of nodes, or 15 bidirectional links. The 15 links burn a lot more power, consume huge numbers of connector pins, and require too much board space that could be used for memory and other functions.

If you drop back to a 3D cube with 8 CPUs, the same rules apply. The number of bidirectional links for a 3D cube is 3 per node, and the worst-case number of hops is 3. Compare that to 8 CPUs in a mesh: 7 bidirectional links per node. When you start hooking together more than 8 CPUs, you must go to n-dimensional architectures to minimize the board space for the link chips, reduce power consumption, and minimize the number of connector pins.

Protocol kills

You can build some effective low-latency supercomputing architectures using the Publish-Subscribe (P-S) model. In a P-S architecture, you can use the switches available for the fabrics today of InfiniBand, Ethernet, Serial RapidIO, PCI Express, and so on.

The switches have a function called “broadcast” in which any node can send data to the switch, and that data will be sent to all the other nodes. For example, the data will be “published.” The other nodes can examine the packet header (“snooping”) and take the data. (The node is a “subscriber.”) Using the “broadcast” function avoids the heavy protocol stack overhead commonly found in the fabric chips. Many military applications might already be using the switch-chip broadcast function, implementing the P-S model in their new VPX systems.

Nothing new under the sun

We have been sort of doing this P-S model, but on a smaller scale, with VME boards. Several companies have sold “reflective memory” cards for years, which implement an elementary P-S model. These boards are used to build data recorders for military systems and other data-intensive applications, which required low-latency connections. Rather than publishing the data to multiple subscribers, the reflective-memory links send the data from one board to another (point-to-point).

So, get up to speed on n-dimensional architectures, hypercubes, and P-S models. The algorithm jockeys are driving us to multiprocessor VPX-based supercomputing systems at a rapid pace.For more information, contact Ray at [email protected].