In the most recent episode, we talked about building embedded supercomputers with VPX using meshes, switches, and 4D/6D hypercubes. Now, let’s take a look at Torus architectures. A Torus is an architecture made of rings at right angles to other rings. They look like squares (x, y) or cubes (x, y, z) with the node connections made with rings in each dimension. The smallest Torus you can build is a square (4 nodes). Just connect the nodes in the x dimension with a ring, and the nodes in the y dimension with another ring. What you get are 2 vertical rings (in the x dimension) connected to 2 horizontal rings (in the y dimension).
If we move to 3D, we basically have a cube (8 nodes). As you can see in Figure 1, we have rings connecting the nodes in the x, y, and z dimensions. And every node is connected to a ring in each of the 3 dimensions. Each node has an input and an output link to each ring, or 6 links per node. A traditional 3-dimensional cube architecture also has 6 links (3 bidirectional links per node), but as you scale to the next dimension, you must add more links. So a Torus architecture uses fewer data links per node than other architectures as it scales. Even with fewer links, a Torus is very survivable. If a ring breaks, you still have multiple paths to get the data to its destination, through the remaining operational rings. The new path might add some latency, but the machine will still run even with some failed interconnects.
While small Torus machines can be built with VPX (4, 8, and 16 nodes), they are primarily used to build monster supercomputers with thousands of nodes. A Torus is incredibly scalable: Just expand the diagram here in all 3 dimensions (x, y, and z) and you can see that the cube grows rapidly in size. The Cray X3T, IBM Blue Gene/L, Seamicro/AMD, and other massively parallel supercomputers use Torus architectures. But there are serious computer science problems associated with a Torus: Live-locks, deadlocks, race conditions, and infinite loops where the data never gets to its destination are just a few examples. The makers of these monster machines place routing algorithms in hardware on each node to resolve these problems and keep the machine from locking up.
The folks at Mellanox put their InfiniBand switches at each node and hang multiple CPUs or storage devices off each node switch. This resolves some of the software problems and creates a hybrid switch/mesh/Torus architecture that scales even faster and more efficiently than a normal Torus consisting of CPUs connected to right-angle rings. There are many variations on a theme using the basic Torus as the foundation.
So, if you need to hook up 2 or 3 CPUs, a mesh works nicely. For 4 to 6 CPUs, a switch works best and is cost effective. For 8 to 16 CPUs, you can use a Torus cube or a hypercube, depending on the application demands and the budget. What we see today are VPX machines built mostly with small meshes and switches. But as we move into High Performance Embedded Supercomputing (HPES) applications, you’ll see more esoteric architectures like hypercubes and Torus.