The key performance metric of interest for military DSP systems is the speed of performing floating arithmetic operations, which is referred to typically as GFLOPS when discussing the speed of computers. In recent history, these DSP systems were commonly built using Texas Instruments 320C40 and 320C6701k and Analog Devices SHARC dedicated DSP processors, which were themselves followed by a number of generations of PowerPC/Power Architecture processors with AltiVec. All of these processors offered good floating point performance per watt and all were available from vendors with a history and track record of support for military embedded customers. Now, with the introduction of Intel’s 2nd Generation Core i7-2715QE quad-core processor, the design of x86-based embedded military DSP systems and high-performance SBCs takes a significant leap forward.
Intel refers to their product introduction cadence as the “Tick-Tock” model. A “tick” is when Intel delivers new silicon process technology with increased transistor density, and enhanced performance and energy efficiency within a smaller version of an existing microarchitecture. The 2nd Generation Intel Core i7 is a “tock,” which is when an entirely new microarchitecture is introduced on an existing semiconductor process technolgoy. Using the 32 nm process introduced with the Westmere generation, the 2nd Generation Core i7 (previously code-named “Sandy Bridge”) features many architectural improvements (especially in the cache subsystem) that lead to improved performance per clock cycle. It is the nature of microprocessor design that revised architectures typically provide incremental performance improvements. However, the 2nd Generation Core i7 has delivered a major leap forward in the signal processing capability of the processor, thanks to the new 256-bit wide Intel Advanced Vector Extensions (AVX) floating-point instruction set, which supercedes the earlier 128-bit Streaming SIMD Extensions (SSE) instructions.
While the new Core i7 brings many advantages for DSP system designs, SBCs used in conjunction with Core i7-based DSP engines also benefit. SBCs can now take advantage of the first ever support for Serial RapidIO on Intel Architecture, as the result of an upcoming PCIe2-to-Serial RapidIO2 bridge chip from IDT that will provide a common communications path and improve interoperability in a complete system. The new Intel processor also supports 16 lanes of Gen2 PCIe for full-bandwidth communications across high-performance processor cards. Intel’s hyperthreading technology provides for running two execution threads on each core, enabling greater utilization of the execution units and providing improved power efficiency. Published reports show performance increases of 7 to 34 percent due to hyperthreading alone.
The AVX 256-bit difference
Prior to the introduction of Intel’s new 256-bit AVX, developers of military DSP systems typically turned to 128-bit AltiVec-enabled CPUs for vectorized signal processing functions. In the past few years, development of new AltiVec-enabled processors slowed significantly, leaving DSP system developers with limited options. In the meantime, Intel continued to invest in and enhance its own high-performance vectorized processing solution with continual enhancements to Intel Streaming SIMD Extensions, a 128-bit wide processing unit predecessor to AVX, capable of simultaneously operating on four 32-bit floating point values. Intel SSE also featured support for double-precision floating point, a feature not available in AltiVec. In Intel’s earlier multicore processors, each core was provided with its own SSE unit, so raw floating-point performance scaled with the number of cores. In the new Core i7 Intel has upgraded SSE with AVX, doubling the size to 256-bits wide.
This doubled vector processing performance is a significant milestone in DSP system design. DSP algorithms used in critical military applications such as radar, SIGINT, and image processing depend on the precision achieved with floating point numbers combined with the speed of processing. The new Core i7 doubles the peak performance of SSE. When compared to SSE in actual FFT kernels, AVX has been benchmarked up to 1.8x faster than SSE (Figure 1). The AVX instruction set was designed to support future extensions, which hints at wider implementations in the future.
Serial RapidIO onboard
Serial RapidIO is the preferred fabric for the types of processor-to-processor communications required by demanding military DSP systems. This is because of Serial RapidIO’s reliable packet transmission and ability to deliver low and predictable latencies. These benefits of RapidIO messaging are ideal for large peer-to-peer clusters of processors typically used in complex signal processing applications. With the Intel 2nd Generation Core i7, Serial RapidIO is supported on Intel architecture-based OpenVPX/VITA 65 embedded boards for the first time with an easy, cost-effective interconnect provided by IDT’s upcoming PCI Express (PCIe) Gen2-to-Serial RapidIO protocol conversion bridging semiconductor product.
Before this newest generation of Core i7, the lack of support for Serial RapidIO for Intel platforms severely limited the viability of using Intel architecture in DSP multiprocessor system designs. Solutions for Intel have included support for fabrics such as InfiniBand and Gigabit/10 Gigabit Ethernet, which are not embraced in military applications because of their non-industrial temperature silicon and relatively high power consumption. For SBCs, where the requirement is typically a single processor communicating with I/O, these fabrics have been sufficient, but would-be Intel-based DSP military designers were deprived of the option to design systems around Serial RapidIO, the multiprocessor fabric of choice.
IDT’s PCIe-to-Serial RapidIO bridge and new Gen2 Serial RapidIO switches will enable system designers to build Intel architecture-based processing engines with much more fabric bandwidth than that offered by any other currently available technology. The upcoming IDT bridge product supports 5 Gbps interfaces on both PCIe and Serial RapidIO ports. With the advantage of small size and low power consumption, system designers can add bandwidth by using multiple PCIe2-to-Serial RapidIO2 bridges connected directly to the processors or via a PCIe switch. This performance can scale at the system level with the new Gen2 Serial RapidIO: This new generation of systems will deliver double the backplane bandwidth provided by the already fast 3.125 Gbps Gen1 Serial RapidIO technology. A 19" rack, OpenVPX processing system will be able to deploy 1.2 terabits per second of fabric bandwidth. The Intel/Serial RapidIO combination is also suited for SwaP-constrained systems, as designers can maximize the power available for actual computing knowing that Serial RapidIO fabric technology provides the best bandwidth/watt.
Serial RapidIO bridges implemented in FPGAs don’t support high-performance messaging, a feature which directly maps to higher-level software APIs such as MPI. IDT’s new bridge product will support the two main Serial RapidIO transfer modes, Serial RapidIO messaging, and memory-mapped transfers. Another benefit of the IDT silicon is the inclusion of DMA engines that speed computation while offloading the host processor. Intel processors typically don’t have DMA engines on-chip, but depend instead on the peripheral chip to move data. Without a DMA engine, moving data can require a large amount of the host processor’s attention, with the result that a multicore processor might have one of its cores (and associated power) largely consumed by moving data, which is all the more burdensome because it has to be done in code.
Another advantage of Serial RapidIO for SWaP-constrained military systems is its ability to support distributed switch and centralized switch architectures. Distributed switch systems (an example is the VITA 65 BPK6-CEN05-11.2.5-n backplane profile) can make use of the local Serial RapidIO switch and thus avoid the need for a separate switch card. For example, if the system were using a ½ ATR Short enclosure (four 1" slots), this capability saves 25 percent of the space and a considerable amount of power. For large systems, centralized switch architectures are often preferred, and Serial RapidIO is equally adept at this approach.
An example of a high-performance DSP engine designed to take full advantage of the latest offering for Intel’s Core i7 is the new CHAMP-AV8 from Curtiss-Wright Controls Embedded Computing (Figure 2). The CHAMP-AV8 is an Intel Core i7-2715QE-based rugged, high-performance OpenVPX DSP engine. Performance of this dual Core i7 DSP engine is rated at up to 269 GFLOPS. It also supports the IDT Gen2 PCIe-to-Serial RapidIO bridge product, effectively tripling the bandwidth of first-generation VPX products with up to 240 Gbps of fabric performance. CS
Curtiss-Wright Controls Embedded Computing
703-779-7800
www.cwcembedded.com