Loop Unrolling Transformation

From emmtrix Wiki
Jump to navigation Jump to search

Loop unrolling is an optimization technique that reduces the number of iterations in a loop by expanding its body to process multiple elements per iteration. This transformation decreases loop overhead, improves execution efficiency, and can enhance opportunities for parallelization. By reducing control flow instructions, loop unrolling minimizes branching and increases instruction-level parallelism, making it particularly useful for performance-critical applications. While it can lead to larger code size, the trade-off often results in significant runtime improvements.

Partial and Full Loop Unrolling

Loop unrolling can be either partial or full, depending on how many iterations are combined into a single loop body.

  • Partial Unrolling: This technique reduces the number of iterations by a factor of N, known as the unroll factor. For example, with an unroll factor of 4, a loop that originally runs 16 times would now run only 4 times, processing 4 iterations' worth of data in each pass. The remaining iterations (if the iteration count is not perfectly divisible by the unroll factor) are handled in a separate, smaller loop known as a cleanup loop. Partial unrolling balances reduced control overhead with manageable code size.
  • Full Unrolling: In full unrolling, the loop is eliminated entirely, and each iteration is explicitly written out as a separate block of code. This maximizes reduction in loop overhead and allows for aggressive compiler optimizations, such as instruction reordering and parallelism. Full unrolling is typically feasible only for small, fixed-size loops where the number of iterations is known at compile time. While this can lead to significant speed improvements, it also increases code size (code bloat), which can negatively impact instruction cache performance in larger programs.

Loop Unrolling in C/C++ Compilers

Most modern C/C++ compilers apply automatic loop unrolling when aggressive optimization levels (such as -O2 or -O3) are enabled. The decision to unroll a loop depends on factors such as loop trip count, loop body complexity, and potential performance gains. Compilers analyze loops to identify cases where unrolling reduces overhead or exposes further optimization opportunities, such as vectorization.

In addition to automatic unrolling, developers can explicitly influence unrolling behavior through compiler-specific pragmas. These pragmas allow developers to:

  • Force a specific unroll factor.
  • Disable unrolling for performance or code size reasons.
  • Request full unrolling for small loops.

Compiler pragmas for unrolling provide fine-grained control over how loops are transformed, which is useful when compiler heuristics do not align with application-specific performance goals. For example, manually unrolling cache-sensitive loops can improve data locality, while avoiding unrolling in some cases can reduce code bloat.

Loop unroll pragmas

Pragmas are compiler directives that influence how a compiler processes specific sections of code, such as loops. They offer direct control over transformations like loop unrolling, bypassing the compiler's default heuristics. This allows developers to optimize for performance or code size, depending on the application requirements. Below are pragma options available for different compilers.

Generic Pragmas (Applicable across multiple compilers)

  • #pragma unroll(n) — Requests the compiler to unroll the loop by a factor of n.
  • #pragma nounroll — Explicitly disables unrolling for the annotated loop.

Clang Pragmas

  • #pragma clang loop unroll(enable) — Enables loop unrolling.
  • #pragma clang loop unroll(disable) — Disables loop unrolling.
  • #pragma clang loop unroll(full) — Requests full unrolling of the loop.
  • #pragma clang loop unroll_count(4) — Specifies an unroll factor of 4.

GCC Pragmas

  • #pragma GCC unroll — Enables automatic unrolling.
  • #pragma GCC nounroll — Prevents any unrolling.
  • #pragma GCC unroll(UNROLLCOUNT) — Requests a specific unroll factor.

OpenMP Pragmas

OpenMP 5.0 introduced loop unrolling pragmas to allow explicit unrolling in parallel programs:

  • #pragma omp unroll — Enables unrolling with default heuristics.
  • #pragma omp unroll full — Requests full unrolling.
  • #pragma omp unroll partial — Enables partial unrolling.
  • #pragma omp unroll partial(3) — Specifies an unroll factor of 3.

These pragmas give developers flexibility to tailor loop transformations based on hardware characteristics (e.g., cache size, vector register width) or software constraints (e.g., real-time requirements or binary size limits).

Loop Unrolling Transformation in emmtrix Studio

emmtrix Studio supports loop unrolling both through #pragma directives and via the graphical user interface (GUI). Unrolling reduces the number of loop iterations by expanding the loop body to process multiple elements in each iteration.

By default, the transformation applies partial loop unrolling, resulting in two loops:

  • The first loop is the partially unrolled loop, processing multiple iterations per pass. -
  • The second loop handles any remaining iterations that do not fit into the unroll factor (cleanup loop).

emmtrix Studio attempts to calculate a trip count that enables two key optimizations:

  • Full unrolling is applied if the unroll factor is greater than or equal to the trip count.
  • The cleanup loop is omitted if the trip count is exactly divisible by the unroll factor.

Example

In the following example, partial loop unrolling is applied with an unroll factor of 3.

Please note: A condition of i < n - 2 would be incorrect as n - 2 overflow for values like n = INT_MIN.

#pragma EMX_TRANSFORMATION LoopUnroll { "unrollfactor": 3 }
for (int i = 1; i < n; i++) {
    printf("%d\n", i);
}
for (i = 1; (i + 2) < n; i = i + 3) {
    printf("%d\n", i);
    printf("%d\n", i + 1);
    printf("%d\n", i + 2);
}

for (; i < n; i = i + 1) {
    printf("%d\n", i);
}

Parameters

The following parameters can be configured (each parameter includes its pragma keyword and default value):

Parameter Default Value Description
unrollfactor 2 Unroll factor - Specifies the loop unroll factor.
subfunctions false Apply to subfunctions - Specifies whether the transformation should also apply to subfunctions.

External Links