The high-performance embedded computing landscape is a rapidly changing one. As recently as two years ago, the processor of choice for compute-intensive signal- and image-processing applications was PowerPC with AltiVec. Today the choices are many and diverse: PowerPC with or without AltiVec, Intel Architecture with SSE or AVX, GPGPU with CUDA or OpenCL, Tilera devices, ARM devices with SIMD coupled with DSPs, and more. These processors invariably allow multithreading, offering the promise of superior performance.
For the application developer, the way forward is clear – use standards-based libraries such as VSIPL and VSIPL++ or higher-level abstractions such as model-based design that exploit such libraries under the hood by linking to optimized runtime libraries.
For COTS board developers, the challenge is how to optimize libraries for all these disparate architectures without resorting to coding methods that would be economically unviable. The problem is that multiple library APIs must be optimized for multiple processor architectures, and must run under multiple revisions of multiple operating systems. The implication of this is that code generation becomes a huge task, and quality testing becomes even more challenging.
The impact on integrators is that the introduction of libraries might lag the introduction of a new processor by months to years, and the quality of the implementation might be suspect as hand coding many hundreds of functions can take several man-years. Here, we explore some of the techniques and technologies that can be exploited to expedite the generation of optimized libraries, yielding better performance.
Compilers play a key role
In the past, it was the preserve of dedicated and costly compilers to take standard C code and vectorize it. This is the process of parsing a code module, identifying loops that are candidates to be transformed into vector form, performing dependency analysis to check if it can be verified that such a transformation is safe in terms of guaranteed-correct results, and finally generating substitute code that calls vector library functions or substitutes vector language macros into the code. For instance, the loop depicted in Figure 1 can, in the right circumstances, be replaced by a library call.
These days, if such a library function has been created with code optimized to use a SIMD architecture, this simple transformation can yield order of magnitude increases in performance with no change to the original code base. Other compilers convert the loop code directly to C primitives that implement the SIMD instruction set in use (Figure 2).
Note that the SIMD code has been “strip-mined.” SIMD engines have a fixed width that allows a certain number of data to be processed in parallel. In this example, the vector pipeline is 128 bits wide, or four 32-bit single precision floating-point values. As the vector width is 4, a loop is needed to iteratively process the complete data set. (Code that would be required for alignment and boundary conditions is excluded for clarity.) This looping to fit an arbitrarily sized vector to the fixed pipeline width is known as “strip-mining.”
Additionally, one of the most flexible features of the C language is the use of pointers to access memory. It is this very flexibility that is one of the biggest impediments to automatic vectorization. Consider rewriting the aforementioned loop in Figure 2 to a subroutine, as shown in Figure 3.
As the compiler has no way to verify that the arrays a, b, and c do not overlap at all, it is not possible to ascertain that this loop is safe to vectorize, as there might be hidden data dependencies. A compiler is likely to reject this as a candidate for transformation.
All is not lost, however. Firstly, if the programmer can categorically state that overlaps will never occur, the programmer can tell the compiler this extra detail. The usual mechanism is via a compiler pragma, which is a directive to the compiler. For example, for some compilers, inserting #pragma ivdep in the Figure 3 subroutine will allow it to be vectorized. Secondly, some compilers will produce code that does a runtime check of the vectors to see if they overlap, and will branch accordingly. If the vectors overlap, it will execute scalar code; if not, it will execute vector SIMD code. Executing scalar code nullifies the benefit of vectorization – so no performance increase is achieved.
In recent years, automatic vectorization technology is available in commonplace compilers such as gcc and Intel C. While this will produce reasonably performing code, there are still some tricks of the trade that an experienced engineer can apply to increase the performance. Sometimes compilers insert code to ensure correct execution in all circumstances. An engineer can possess knowledge about the boundaries of operation that will be encountered, and can sometimes remove some of this redundant code for more efficient execution.
In addition, many processor architectures include hardware to pre-fetch data from memory to the caches, usually under the control of hint operations that can be inserted in the code. The use of these hints is somewhat of a black art, as injudicious use can hurt performance. But an experienced coder knows how best to employ them to schedule the pre-fetches far enough ahead that the data is in cache when needed but not too far ahead as to incur cast-out or cache thrashing. The former is where the data that is needed was placed in cache ahead of time, but intermediate execution demanded enough other data that it has been bounced back out to main memory by the cache management algorithms before it can be consumed. This can lead to the latter, cache thrashing, where more time is spent loading and reloading the data to the caches than consuming it.
Additionally, for many years, multithreading (the execution of multiple units of processing) has been available via time-slicing on a single processor. Most of the current-generation processors support multithreading on multiple cores, virtual cores, or a combination of both, giving true concurrent execution. One common way to exploit this architecture is by using OpenMP, which is a combination of an Application Programming Interface, compiler directives, and runtime libraries. Many compilers, such as gcc and Intel C, have built-in support. As a simplistic example, the vadd routine depicted in Figure 3 could be multithreaded as shown in Figure 4, with the simple addition of a compiler pragma to allow the loop to be run across multiple threads (as controlled by a function call or environment variable).
Some vendors choose to use Posix pthreads to write multithreaded math code. Sometimes this might result from a lack of OpenMP in some operating systems, and sometimes it might be driven by a desire to have finer-grained control of the algorithm decomposition.
< did>Even when using OpenMP, programmers must be aware of the traditional issues of data locality and so on. It might be necessary to exert control over core affinity, or how threads are allocated and reallocated to physical or virtual cores. Whether shared resources like L2 cache are available between certain threads can have a large, although difficult to quantify, impact on performance.
Sophisticated tools help maximize performance
For more esoteric functions – as compared to the loop adds, subtracts, or multiplies generated by compilers – a more sophisticated approach might be required when optimizing parallel library implementations to improve performance. For example, the SPIRAL project (http://spiral.net) expresses linear transforms as Kronecker products, which are then converted by a program generation system to vector expressions that can be implemented in a variety of architectures. Other commercial vendors who specialize in math libraries have created similar tools that allow algorithms to be expressed in an abstracted form that is then typically processed to generate C or assembly code primitives that map to the instruction set of a particular SIMD platform.
Model-based design is being used increasingly to generate deployed code. Previously, the output of these tools was sufficient to validate the algorithm but not enough for real-time application and often required a phase of manual rewriting to be useable. As the code generation capability of these tools has evolved, more of the resultant code is good enough to use as-is. MATLAB, for instance, can now target GPGPUs as well as x86 processors with MKL and IPP libraries.
Vendors to expand libraries
Processor vendors are continually increasing the availability of math libraries that have been optimized by their internal experts with intimate knowledge of how to coax the maximum possible performance from their architecture. Intel has its Math Kernel Library, which includes BLAS, LAPACK, ScaLAPACK1, Sparse Solvers, fast fourier transforms, vector math, and more. NVIDIA’s CUDA has support for CUBLAS, CUFFT, CUSP (sparse linear algebra), performance primitives, and more. The performance is hard to beat, but one drawback might be limited operating system support. These libraries tend to stick with the mainstream of Windows and Linux. If a hard real-time operating system is required, the options are more limited: At present, technologies such as CUDA, Intel MKL and IPP are not available for real-time operating systems.
Libraries allow focus on the application
There are many tools available today to assist in writing high-performance math libraries. Vendors like GE Intelligent Platforms that supply math libraries with a consistent API across a multiplicity of processors and operating systems are increasingly exploiting such tools to assist in supporting all these diverse platforms. They still, however, apply deep-rooted knowledge of architectures to coax the ultimate performance possible. By leveraging standard math libraries, such as GE’s AXISLib-VSIPL and tools like AXISView, engineers can rapidly write code that optimally exploits the processor of choice and can view and tune the performance. Parameters such as cache hit/miss ratios and CPU loading can be viewed as time graphs that illustrate how well the algorithm is tuned to the system. Simple reconfiguration of task and thread placement and affinities can be executed within the tools to further fine-tune the performance.
The good news for application developers is that they can concentrate on the application, with the knowledge that the math libraries they are using will yield the best possible performance from the hardware while maintaining code portability across multiple processor generations and architectures.
GE Intelligent Platforms
www.ge-ip.com