Massachusetts Institute of Technology (MIT) and other universities announced new algorithms for datadriven applications recently. Advanced Fast Fourier Transforms (FFTs), SWARM algorithms for Unmanned Aerial Vehicles (UAVs) and Unmanned Underwater Vehicles (UUVs), algorithms for extracting “fat tail” data in radar/sonar applications, and new beamforming algorithms for SIGnals INTelligence (SIGINT) are examples. These new algorithms need advanced supercomputing architectures such as VPX.
Start with a hypercube
At the May VITA Standards Organization (VSO) meetings, a proposal to add the profiles for 4dimensional and 6dimensional hypercubes in the VITA 65 (OpenVPX) specification was accepted. When you start hooking together 8 or more CPUs, you must think about computer architectures in greater than 3 dimensions. The first 4D architecture is a hypercube, a tesseract. Many of the new algorithmdriven applications could require more than 8 CPUs, so the fourth dimension is a good place to start.
In the early 1980s, David May and Robert Milne of Inmos developed a new microprocessor chip, the Transputer. They hooked 16 processors together, using the slow serial links, into a 4D hypercube architecture. The machine ran great, but the data links were way too slow. Each of the processors was data starved. Even with the multigigabit fabrics available today, processors in a hypercube are still data starved, depending on the data sharing patterns between the nodes.
In any ndimensional architecture, the worstcase number of hops (how many nodes the data must pass through before it arrives at its destination) is the number of dimensions of the architecture (n). In a 4D hypercube, the worstcase number of hops is 4 (Figure 1). To overcome this hop latency, you must put the applications that share the most data on the CPU nodes that are closest to each other (parsing).
(Click graphic to zoom by 1.9x)

Hookingup 16 processors
The ndimensional architectures minimize the number of links on each node. The number of fullduplex links required per node is also the number of dimensions of the architecture (n). So, for a 4D hypercube (16 processors), each CPU board needs 4 bidirectional links. Compare that to the worstcase 2D architecture, a mesh of 16 CPUs: The number of links per node is (n1), where (n) is the number of nodes, or 15 bidirectional links. The 15 links burn a lot more power, consume huge numbers of connector pins, and require too much board space that could be used for memory and other functions.
If you drop back to a 3D cube with 8 CPUs, the same rules apply. The number of bidirectional links for a 3D cube is 3 per node, and the worstcase number of hops is 3. Compare that to 8 CPUs in a mesh: 7 bidirectional links per node. When you start hooking together more than 8 CPUs, you must go to ndimensional architectures to minimize the board space for the link chips, reduce power consumption, and minimize the number of connector pins.
Protocol kills
You can build some effective lowlatency supercomputing architectures using the PublishSubscribe (PS) model. In a PS architecture, you can use the switches available for the fabrics today of InfiniBand, Ethernet, Serial RapidIO, PCI Express, and so on.
The switches have a function called “broadcast” in which any node can send data to the switch, and that data will be sent to all the other nodes. For example, the data will be “published.” The other nodes can examine the packet header (“snooping”) and take the data. (The node is a “subscriber.”) Using the “broadcast” function avoids the heavy protocol stack overhead commonly found in the fabric chips. Many military applications might already be using the switchchip broadcast function, implementing the PS model in their new VPX systems.
Nothing new under the sun
We have been sort of doing this PS model, but on a smaller scale, with VME boards. Several companies have sold “reflective memory” cards for years, which implement an elementary PS model. These boards are used to build data recorders for military systems and other dataintensive applications, which required lowlatency connections. Rather than publishing the data to multiple subscribers, the reflectivememory links send the data from one board to another (pointtopoint).
So, get up to speed on ndimensional architectures, hypercubes, and PS models. The algorithm jockeys are driving us to multiprocessor VPXbased supercomputing systems at a rapid pace.For more information, contact Ray at [email protected].