Part II: Vectorization – Programming Model and Specifics
March 02, 2022
Wonder how to effectively utilize vector processing units of next-generation safety-certified MCUs to accelerate your compute-intensive applications?
Let’s start with an introduction to the programming model and the specifics of programming vector processing units.
In our previous post, we introduced vector processing and argued that data-processing intensive applications based on linear algebra can be sped up by factors greater than ten. This is achieved by applying multiple identical operations to a fixed number of data elements in parallel.
For example, a vector unit might be capable of performing 16 32-bit integer or floating-point additions in one go. In contrast to multicore programming where work is distributed amongst multiple cores, vector processing-enabled cores execute vector calculations as part of their normal program flow, where it is possible to mix vector instructions with scalar instructions. Vector instructions usually retrieve their operands from dedicated vector registers, to where the result is also written. These registers are quite large as they need to accommodate all the operands processed in parallel, which is 16 x 32 bits, i.e., 512 bits in our example. To improve data locality, some architectures feature high-bandwidth local memory. Management of data mapping to registers and local memory, which is usually done manually or by software, is crucial to achieve optimal performance after vectorization.
Support for vector processing is implemented differently in competing processor architectures. Implementations differ, for example, in the supported calculations, the degree of parallelism and the memory architecture. This explains the absence of a universal and portable method for programming vector units.
While there are compilers that support auto vectorization, they often produce inefficient code, because automatically mapping complex program structures like loops to vector instructions is quite a challenge. Therefore, writing vector code in assembly or using compiler intrinsics is still common practice. While this potentially yields the best performance, it is time-consuming, error prone and requires a lot of knowledge about the hardware to get it right. Another commonly used approach is utilizing library functions that provide an optimal implementation of pre-defined algorithms for the vector unit. This works well for common algorithms but cannot be applied to special problems. There are also efforts to support vectorization with APIs like OpenMP and OpenCL. However, compilers that support this are not generally available for different target architectures.
Vector units offer exciting acceleration potential but programming them represents a substantial challenge. In our next part we will show you how emmtrix Parallel Studio can help you make use of the hardware efficiently and how your model-based workflow with Simulink can benefit from our vectorization solution.
Visit our website “vectorization” for more information and/or register for our webinar “Vectorization for Infineon AURIX™ TC4x” which we are scheduling around embedded world 2022.