# CS 610: Vectorization

Swarnendu Biswas

Semester 2022-2023-I CSE, IIT Kanpur

Content influenced by many excellent references, see References slide for acknowledgements.

# Material adapted from

M. Garzaran et al. Program Optimization Through Loop Vectorization.M. Voss. Topics in Loop Vectorization.

## Different Levels of Parallelism in Hardware

- Instruction-level Parallelism
  - Microarchitectural techniques like pipelining, OOO execution, and superscalar
- Vector-level Parallelism
  - Use Single Instruction Multiple Data (SIMD) vector processing instructions and units
- Thread-level Parallelism
  - Hyperthreading

#### Vectorization

- Process of transforming a scalar operation on single data elements at a time (SISD) to an operation on multiple data elements at once (SIMD)
- Loop vectorization transforms a program so that the same operation is performed at the same time on several vector elements



M. Voss. Topics in Loop Vectorization.

#### Vectorization



M. Garzaran et al. Program Optimization Through Loop Vectorization.

#### The combined effect of vectorization and threading



#### The Difference Is Growing With Each New Generation of Hardware

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> Configurations at the end of this presentation.

#### Optimization Notice

Copyright © 2018, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



Kirill Rogozhin, Intel. Vectorization.

#### Intel-Supported SIMD Extensions

| SIMD<br>extensions | Width (bits) | Dual precision<br>(64 bit)<br>calculations | Single<br>precision (32<br>bit)<br>calculations | Introduced |
|--------------------|--------------|--------------------------------------------|-------------------------------------------------|------------|
| SSE2/SSE3/SSE4     | 128          | 2                                          | 4                                               | ~2001-2007 |
| AVX/AVX2           | 256          | 4                                          | 8                                               | ~2011-2015 |
| AVX-512            | 512          | 8                                          | 16                                              | ~2017      |

Other platforms that support SIMD have different extensions

#### SIMD Vectorization

- Use of SIMD units can speed up the program
- Intel SSE has 128-bit vector registers and functional units
  - 4 32-bit single precision floating point numbers
  - 2 64-bit double precision floating point numbers
  - 4 32-bit integer numbers
  - 2 64 bit integer
  - 8 16-bit integer or shorts
  - 16 8-bit bytes or chars
- Assuming a single ALU, these SIMD units can execute 4 single precision floating point number or 2 double precision operations in the time it takes to do only one of these operations by a scalar unit

### 128-bit wide operands using integers

| Int64                                   |             |                |           | Int64 |       |      |      |       |      |      |      |      |
|-----------------------------------------|-------------|----------------|-----------|-------|-------|------|------|-------|------|------|------|------|
| 63 0 63 0                               |             |                |           |       |       |      | 0    |       |      |      |      |      |
| Int                                     | Int32 Int32 |                |           |       | Int32 |      |      | Int32 |      |      |      |      |
| 31                                      | C           | 0 31 0 31 0 31 |           |       |       |      | 0    |       |      |      |      |      |
| Int16                                   | Int16       | Int16          | 6 In      | t16   | Int   | :16  | Int  | :16   | Int  | :16  | Int  | 16   |
| 15 0 15 0 15 0 15 0 15 0 15 0 15 0 15 0 |             |                |           |       |       |      |      |       |      |      |      |      |
| int8 Int8                               | int8 Int8   | int8 li        | int8 Int8 | Int8  | Int8  | Int8 | Int8 | Int8  | Int8 | Int8 | Int8 | Int8 |
| 7 0 7 0                                 | 7070        | 707            | 070       | 7 0   | 7 0   | 7 0  | 7 0  | 7 0   | 7 0  | 7 0  | 7 0  | 7 0  |

Daniel Kusswurm. Modern X86 Assembly Language Programming.

#### SSE Data Types



### AVX2 Data Types



#### Intel-Supported SIMD Extensions



| 64-bit architecture |            |                                                                                               |  |  |  |  |  |
|---------------------|------------|-----------------------------------------------------------------------------------------------|--|--|--|--|--|
| SSE                 | XMM0-XMM15 |                                                                                               |  |  |  |  |  |
| AVX                 | YMM0-YMM15 | Low- order 128 bits of each YMM register are aliased to a corresponding XMM register          |  |  |  |  |  |
| AVX-512             | ZMM0-ZMM31 | Low-order 256 and 128 bits are aliased to registers<br>YMM0-YMM31 and XMM0-XMM31 respectively |  |  |  |  |  |

#### x86-64 Vector Operations

[] – required () - optional

- Example instructions
  - Move: (V)MOV[A/U]P[D/S]
  - Comparing: (V)CMP[P/S][D/S]
  - Arithmetic operations: (V)[ADD/SUB/MUL/DIV][P/S][D/S]
- Instruction decoding
  - **V** AVX
  - P, S packaged, scalar
  - A, U aligned, unaligned
  - D, S double, single
  - B, W, D, Q byte, word, doubleword, quadword integers

#### x86-64 Vector Operations

| Instruction        | Explanation                                                                      |
|--------------------|----------------------------------------------------------------------------------|
| movss xmm1, xmm2   | Move scalar single-precision floating-point value from xmm2 to xmm1              |
| vmovapd xmm1, xmm2 | Move aligned packed double-precision floating-<br>point values from xmm2 to xmm1 |

# AVX Scalar Floating-Point Instruction Examples

| Instruction | Explanation                                                                                 |
|-------------|---------------------------------------------------------------------------------------------|
|             | xmm0[31:0] = xmm1[31:0] +<br>xmm2[31:0]<br>xmm0[127:32] = xmm1[127:32]<br>ymm0[255:128] = 0 |
|             | xmm0[63:0] = xmm1[63:0] +<br>xmm2[63:0]<br>xmm0[127:64] = xmm1[127:64]<br>ymm0[255:128] = 0 |

#### Cumulative (app.) # of Vector Instructions



Kirill Rogozhin, Intel. Vectorization.

## Ways to Vectorize Code

## Vectorize Code

- Auto-vectorizing compiler
- Vector intrinsics

• Assembly language



#### Vectorize Code

- Auto-vectorization
  - Compiler vectorizes automatically No code changes
  - Semi auto-vectorization Use pragmas as hints to guide compiler
  - Explicit vector programming OpenMP SIMD pragmas



• Inline assembly language



• Use SIMD-capable libraries like Intel Math Kernel Library (MKL)

#### Auto-Vectorization

Transparent to programmers

Compilers can apply other transformations

Portability of code across architectures

• Vectorization instructions may differ but compilers take care of it

#### Auto-Vectorization

Transparent to programmers

Compilers can apply other transformations

#### Portability of code across architectures

• Vectorization instructions may differ but compilers take care of it

#### Compilers may fail to vectorize

- Programmers may give hints to help the compiler
- Programmers may have to manually vectorize their code

#### Data Dependences and Vectorization

- Loop dependences guide vectorization
- A statement inside a loop which is not in a cycle of the dependence graph can be vectorized

#### How well do compilers vectorize?

not XLC

26

ICC

26

49

| Compiler<br>Loops | XLC     | ICC | C GCC   |                      |
|-------------------|---------|-----|---------|----------------------|
| Total             |         | 15  | 59      | 1                    |
| Vectorized        | 74      | 75  | 32      | Not Vectorizable XLC |
| Not vectorized    | 85      | 84  | 127     |                      |
| Average Speed Up  | 1.73    | 1.8 | 5 1.30  |                      |
| <u></u>           |         |     |         |                      |
| Compiler          | XLC but |     | ICC but |                      |

not ICC

25



Loops

Vectorized

### How well do compilers vectorize?



By adding manual vectorization the average speedup was 3.78 (versus 1.73 obtained by the XLC compiler)



#### **Experimental results**

- These slides show vectorization results for two different platforms with the following compilers:
  - Report generated by the compiler
  - Execution Time for each platform

Platform 1: Intel Nehalem Intel Core i7 CPU 920@2.67GHz Intel ICC compiler, version 11.1 OS Ubuntu Linux 9.04 Platform 2: IBM Power 7 IBM Power 7, 3.55 GHz IBM xlc compiler, version 11.0 OS Red Hat Linux Enterprise 5.4

The examples use single precision floating point numbers



## **Compiler directives**

```
void test(float* A,float* B,float* C,float* D, float* E)
{
   for (int i = 0; i <LEN; i++){
        A[i]=B[i]+C[i]+D[i]+E[i];
     }
}</pre>
```



### **Compiler directives**



I

## **Compiler directives**





```
for (int i=0;i<LEN;i++) {
   sum = (float) 0.0;
   for (int j=0;j<LEN;j++) {
      sum += A[j][i];
   }
   B[i] = sum;
}</pre>
```

???

```
for (int i=0;i<LEN;i++) {
   sum = (float) 0.0;
   for (int j=0;j<LEN;j++) {
      sum += A[j][i];
   }
   B[i] = sum;
}</pre>
```

```
for (int i=0;i<size;i++) {
   sum[i] = 0;
   for (int j=0;j<size;j++) {
      sum[i] += A[j][i];
   }
   B[i] = sum[i];
}</pre>
```

```
for (int i=0;i<LEN;i++) {
   sum = (float) 0.0;
   for (int j=0;j<LEN;j++) {
      sum += A[j][i];
   }
   B[i] = sum;
}</pre>
```

```
for (int i=0;i<LEN;i++) {
   sum[i] = (float) 0.0;
   for (int j=0;j<LEN;j++) {
      sum[i] += A[j][i];
   }
   B[i]=sum[i];
   }
}</pre>
```

```
for (int i=0;i<LEN;i++) {
   B[i] = (float) 0.0;
   for (int j=0;j<LEN;j++) {
      B[i] += A[j][i];
   }
}</pre>
```

#### **Intel Nehalem**

Compiler report: Loop was not vectorized. Vectorization possible but seems inefficient. Exec. Time scalar code: 3.7 Exec. Time vector code: --Speedup: -- Intel Nehalem Compiler report: Permuted loop was vectorized. scalar code: 1.6 vector code: 0.6 Speedup: 2.6 Intel Nehalem Compiler report: Permuted loop was vectorized. scalar code: 1.6 vector code: 0.6 Speedup: 2.6

```
for (int i=0;i<LEN;i++) {</pre>
  sum = (float) 0.0;
  for (int j=0;j<LEN;j++) {</pre>
    sum += A[j][i];
  B[i] = sum;
```

```
for (int i=0;i<LEN;i++) {</pre>
    sum[i] += A[j][i];
  B[i]=sum[i];
```

```
for (int i=0;i<LEN;i++) {</pre>
B[i] += A[j][i];
```

**IBM Power 7 Compiler report:** Loop was not SIMD vectorized **Exec. Time scalar code:** 2.0 Exec. Time vector code: --Speedup: --

**IBM Power 7 Compiler report:** Loop interchanging applied. Loop was SIMD vectorized scalar code: 0.4 vector code: 0.2 **Speedup:** 2.0

**IBM Power 7 Compiler report:** Loop interchanging applied. Loop was SIMD vectorized scalar code: 0.4 vector code: 0.16 Speedup: 2.7

#### SSE Intrinsics

```
#define n 1024
```

```
attribute ((aligned(16))) float
a[n], b[n], c[n];
```

```
int main() {
   for (i = 0; i < n; i++) {
      c[i]=a[i]*b[i];
   }
}</pre>
```

```
#include <xmmintrin.h>
#define n 1024
    attribute__((aligned(16))) float
    ā[n], b[n], c[n];
```

```
int main() {
    __m128 rA, rB, rC;
    for (i = 0; i < n; i+=4) {
        rA = _mm_load_ps(&a[i]);
        rB = _mm_load_ps(&b[i]);
        rC= _mm_mul_ps(rA,rB);
        _mm_store_ps(&c[i], rC);
    }
</pre>
```

}

# Challenges in Vectorization

### Data Dependences

- Loop dependences guide vectorization
  - Statements not data dependent on each other can be reordered, executed in parallel, or coalesced into a vector operation

A statement inside a loop which is not in a cycle of the dependence graph can be vectorized



### **Data dependences and vectorization**

• Main idea: A statement inside a loop which is not in a cycle of the dependence graph can be vectorized.

```
for (i=1; i<n; i++){

S1 a[i] = b[i] + 1;

S2 c[i] = a[i-1] + 2;

}

a[1:n] = b[1:n] + 1;

c[1:n] = a[0:n-1] + 2;

s_1

s_2

1
```



### **Data dependences and transformations**

- When cycles are present, vectorization can be achieved by:
  - Separating (distributing) the statements not in a cycle
  - Removing dependences
  - Freezing loops
  - Changing the algorithm



# Distributing

```
for (i=1; i<n; i++){</pre>
S1 b[i] = b[i] + c[i];
S2 a[i] = a[i-1]*a[i-2]+b[i];
S3 c[i] = a[i] + 1;
b[1:n-1] = b[1:n-1] + c[1:n-1];
for (i=1; i<n; i++){
      a[i] = a[i-1]*a[i-2]+b[i];
}
c[1:n-1] = a[1:n-1] + 1;
```





### **Removing dependences**



### **Freezing Loops**





# **Changing the algorithm**

- When there is a recurrence, it is necessary to change the algorithm in order to vectorize.
- Compiler use pattern matching to identify the recurrence and then replace it with a parallel version.
- Examples or recurrences include:
  - Reductions (S+=A[i])
  - Linear recurrences (A[i]=B[i]\*A[i-1]+C[i])
  - Boolean recurrences (if (A[i]>max) max = A[i])



# Stripmining

• Stripmining is a simple transformation.

• It is typically used to improve locality.



# Stripmining (cont.)

• Stripmining is often used when vectorizing

```
for (i=1; i<n; i++){
          a[i] = b[i] + 1;
          c[i] = a[i] + 2;
        }
                          stripmine
for (k=1; k<n; k+=q){
/* q could be size of vector register */
  for (i=k; i < k+q; i++){</pre>
     a[i] = b[i] + 1;
     c[i] = a[i-1] + 2;
}
                           vectorize
  for (i=1; i<n; i+=q){
    a[i:i+q-1] = b[i:i+q-1] + 1;
    c[i:i+q-1] = a[i:i+q] + 2;
  }
```



- Loop Vectorization is not always a legal and profitable transformation.
- Compiler needs:
  - Compute the dependences
    - The compiler figures out dependences by
      - Solving a system of (integer) equations (with constraints)
      - Demonstrating that there is no solution to the system of equations
  - Remove cycles in the dependence graph
  - Determine data alignment
  - Vectorization is profitable



```
for (i=0; i<LEN; i+=strip_size){</pre>
 for (i=0; i<LEN; i++){
                                for (j=i; j<i+strip_size; j++)</pre>
S1 a[i]=b[i]+(float)1.0;
                                   a[j]=b[j]+(float)1.0;
S2 c[i]=b[i]+(float)2.0;
                                 for (j=i; j<i+strip_size; j++)</pre>
 }
                                   c[j]=b[j]+(float)2.0;
                                }
i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7
          (S1) (S1)
                    (S1) (S1) (S1)
                                   (S1)
     (S1)
S1
          (S2) (S2)
                    (S2) (S2)
                                   (S2)
```









# **Dependence Graphs and Compiler Vectorization**

- No dependences: previous two slides
- Acyclic graphs:
  - All dependences are forward:
    - Vectorized by the compiler
  - Some backward dependences:
    - Sometimes vectorized by the compiler
- Cycles in the dependence graph
  - Self-antidependence:
    - Vectorized by the compiler
  - Recurrence:
    - Usually not vectorized by the the compiler
  - Other examples

```
for (i=0; i<LEN; i++) {
S1 a[i]= b[i] + c[i]
S2 d[i] = a[i] + (float) 1.0;
}</pre>
```





```
for (i=0; i<LEN; i++) {
S1 a[i]= b[i] + c[i]
S2 d[i] = a[i] + (float) 1.0;
}</pre>
```



```
for (i=0; i<LEN; i++) {
S1 a[i]= b[i] + c[i]
S2 d[i] = a[i] + (float) 1.0;
}</pre>
```





S113

```
for (i=0; i<LEN; i++) {
    a[i]= b[i] + c[i]
    d[i] = a[i] + (float) 1.0;
}</pre>
```

Intel Nehalem Compiler report: Loop was vectorized Exec. Time scalar code: 10.2 Exec. Time vector code: 6.3 Speedup: 1.6 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.1 Exec. Time vector code: 1.5 Speedup: 2.0



```
for (i=0; i<LEN; i++) {
S1 a[i]= b[i] + c[i]
S2 d[i] = a[i+1] + (float) 1.0;
}</pre>
```

backward dependence





```
for (i=0; i<LEN; i++) {
S1 a[i]= b[i] + c[i]
S2 d[i] = a[i+1] + (float) 1.0;
}</pre>
```

backward dependence





```
for (i=0; i<LEN; i++) {
S1 a[i]= b[i] + c[i]
S2 d[i] = a[i+1] + (float) 1.0;
}</pre>
```

backward dependence





Reorder of statements

for (i=0; i<LEN; i++) {
 for (i=0; i<LEN; i++) {
 S1 a[i]= b[i] + c[i]
 S2 d[i] = a[i+1]+(float)1.0;
 S2 d[i] = a[i+1] + (float) 1.0;
 S1 a[i]= b[i] + c[i];
 }
}</pre>





backward dependence forward dependence





Intel Nehalem Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 12.6 Exec. Time vector code: --Speedup: --

Intel Nehalem Compiler report: Loop was vectorized Exec. Time scalar code: 10.7 Exec. Time vector code: 6.2 Speedup: 1.72 Speedup vs non-reordered code:2.03





The IBM XLC compiler generated the same code in both cases

S114

S114\_1

IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 1.2 Exec. Time vector code: 0.6 Speedup: 2.0 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 1.2 Exec. Time vector code: 0.6 Speedup: 2.0



```
for (int i = 1; i < LEN; i++) {
S1 a[i] = d[i-1] + (float)sqrt(c[i]);
S2 d[i] = b[i] + (float)sqrt(e[i]);
}</pre>
```

backward dependence





```
for (int i = 1; i < LEN; i++) {
S1 a[i] = d[i-1] + (float)sqrt(c[i]);
S2 d[i] = b[i] + (float)sqrt(e[i]);
}</pre>
```

backward dependence





```
for (int i = 1; i < LEN; i++) {
S1 a[i] = d[i-1] + (float)sqrt(c[i]);
S2 d[i] = b[i] + (float)sqrt(e[i]);
}</pre>
```

backward dependence







I

 S114
 S114\_1

 for (i=0; i<LEN; i++) {</td>
 for (i=0; i<LEN; i++) {</td>

 a[i]= b[i] + c[i];
 d[i] = a[i+1]+(float)1.0;

 d[i] = a[i+1]+(float)1.0;
 a[i]= b[i] + c[i];

 }
 }

The IBM XLC compiler generated the same code in both cases

S114

S114\_1

IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.3 Exec. Time vector code: 1.8 Speedup: 1.8 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.3 Exec. Time vector code: 1.8 Speedup: 1.8

I

```
for (int i=0;i<LEN-1;i++){
S1 b[i] = a[i] + (float) 1.0;
S2 a[i+1] = b[i] + (float) 2.0;
}</pre>
```



This loop cannot be vectorized (as it is) Statements cannot be simply reordered



```
for (int i=0;i<LEN-1;i++){
S1 b[i] = a[i] + (float) 1.0;
S2 a[i+1] = b[i] + (float) 2.0;
}</pre>
```



This loop cannot be vectorized (as it is) Statements cannot be simply reordered



```
for (int i=0;i<LEN-1;i++){
S1 b[i] = a[i] + (float) 1.0;
S2 a[i+1] = b[i] + (float) 2.0;
}</pre>
```



This loop cannot be vectorized (as it is) Statements cannot be simply reordered



S115

```
for (int i=0;i<LEN-1;i++){
    b[i] = a[i] + (float) 1.0;
    a[i+1] = b[i] + (float) 2.0;
}</pre>
```

S115

#### **Intel Nehalem**

Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 12.1 Exec. Time vector code: --Speedup: --



S115

```
for (int i=0;i<LEN-1;i++){
    b[i] = a[i] + (float) 1.0;
    a[i+1] = b[i] + (float) 2.0;
}</pre>
```

S115

**IBM Power 7** 

**Compiler report:** Loop was SIMD vectorized **Exec. Time scalar code:** 3.1 **Exec. Time vector code:** 2.2 **Speedup:** 1.4









Will the IBM XLC compiler vectorize this code as before?



To vectorize, the compiler needs to do this

for (int i=0;i<LEN-1;i++)
 a[i+1]=a[i]+d[i]\*d[i]+c[i]\*c[i]+c[i]\*d[i]+(float)2.0;</pre>

for (int i=0;i<LEN-1;i++)
 b[i]=a[i]+d[i]\*d[i]+c[i]\*c[i]+c[i]\*d[i]+(float) 1.0;</pre>



#### S115

```
for (int i=0;i<LEN-1;i++){
    b[i] =a[i]+(float)1.0;
    a[i+1]=b[i]+(float)2.0;
}</pre>
```

#### S215

```
for (int i=0;i<LEN-1;i++){
    b[i]=a[i]+d[i]*d[i]+c[i]*c[i]+c[i]*d[i];
    a[i+1]=b[i]+(float)2.0;
}</pre>
```

Will the IBM XLC compiler vectorize this code as before?

No, the compiler does not vectorize S215 because it is not cost-effective

```
for (int i=0;i<LEN-1;i++)
    b[i]=a[i]+d[i]*d[i]+c[i]*c[i]+c[i]*d[i]+(float) 1.0;</pre>
```



A loop can be partially vectorized





S1 can be vectorized S2 and S3 cannot be vectorized (as they are)



S116

```
for (int i=1;i<LEN;i++){
    a[i] = b[i] + c[i];
    d[i] = a[i] + e[i-1];
    e[i] = d[i] + c[i];
}</pre>
```

S116

```
for (int i=1;i<LEN;i++){
    a[i] = b[i] + c[i];
    d[i] = a[i] + e[i-1];
    e[i] = d[i] + c[i];
}</pre>
```

S116

Intel Nehalem Compiler report: Loop was partially vectorized Exec. Time scalar code: 14.7 Exec. Time vector code: 18.1 Speedup: 0.8

#### S116

**IBM Power 7** 

Compiler report: Loop was not SIMD vectorized because a data dependence prevents SIMD vectorization Exec. Time scalar code: 13.5 Exec. Time vector code: --Speedup: --



can be vectorized

Self true-dependence can not vectorized (as it is)











is not vectorized d

This is also a self-true dependence. But ... can it be vectorized?





Self true-dependence cannot be vectorized

Yes, it can be vectorized because the dependence distance is 4, which is the number of iterations that the SIMD unit can execute simultaneously.

S119

for (int i=4;i<LEN;i++){
 a[i]=a[i-4]+b[i];
}</pre>

Intel Nehalem Compiler report: Loop was vectorized Exec. Time scalar code: 8.4 Exec. Time vector code: 3.9 Speedup: 2.1 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 6.6 Exec. Time vector code: 1.8

Speedup: 3.7



```
for (int i = 0; i < LEN-1; i++) {
   for (int j = 0; j < LEN; j++)
S1   a[i+1][j] = a[i][j] + b;
}</pre>
```



Can this loop be vectorized?

```
i=0, j=0: a[1][0] = a[0][0] + b

j=1: a[1][1] = a[0][1] + b

j=2: a[1][2] = a[0][2] + b

i=1 j=0: a[2][0] = a[1][0] + b

j=1: a[2][1] = a[1][1] + b

j=2: a[2][2] = a[1][2] + b
```



```
for (int i = 0; i < LEN-1; i++) {
   for (int j = 0; j < LEN; j++)
S1   a[i+1][j] = a[i][j] + (float) 1.0;
}</pre>
```



Can this loop be vectorized?

```
i=0, j=0: a[1][0] = a[0][0] + 1

j=1: a[1][1] = a[0][1] + 1

j=2: a[1][2] = a[0][2] + 1

i=1 j=0: a[2][0] = a[1][0] + 1

j=1: a[2][1] = a[1][1] + 1

j=2: a[2][2] = a[1][2] + 1
```

Dependences occur in the outermost loop. - outer loop runs serially

- inner loop can be vectorized

```
for (int i=0;i<LEN;i++){
    a[i+1][0:LEN-1]=a[i][0:LEN-
1]+b;</pre>
```



```
S121
```

```
for (int i = 0; i < LEN-1; i++) {
  for (int j = 0; j < LEN; j++)
      a[i+1][j] = a[i][j] + 1;
}</pre>
```

Intel Nehalem Compiler report: Loop was vectorized Exec. Time scalar code: 11.6 Exec. Time vector code: 3.2 Speedup: 3.5 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.9 Exec. Time vector code: 1.8 Speedup: 2.1



• Cycles can appear because the compiler does not know if there are dependences

```
for (int i=0;i<LEN;i++) {
S1 a[r[i]] = a[r[i]] * (float) 2.0; Is there a value of i such that i' \neq i?
```



Compiler cannot resolve the system

To be safe, it considers that a data dependence is possible for every instance of S1



- The compiler is conservative.
- The compiler only vectorizes when it can prove that it is safe to do it.

```
for (int i=0;i<LEN;i++){
    r[i] = i;
    a[r[i]] = a[r[i]]* (float) 2.0;
}</pre>
```

Does the compiler use the info that r[i] = i to compute data dependences?





S122

#### S123

Intel Nehalem Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 5.0 Exec. Time vector code: --Speedup: -- Intel Nehalem Compiler report: Partial Loop was vectorized Exec. Time scalar code: 5.8 Exec. Time vector code: 5.7 Speedup: 1.01





I

### **Dependence Graphs and Compiler Vectorization**

- No dependences: Vectorized by the compiler
- Acyclic graphs:
  - All dependences are forward:
    - Vectorized by the compiler
  - Some backward dependences:
    - Sometimes vectorized by the compiler
- Cycles in the dependence graph
  - Self-antidependence:
    - Vectorized by the compiler
  - Recurrence:
    - Usually not vectorized by the the compiler
  - Other examples

# **Loop Transformations**

- Compiler Directives
- Loop Distribution or loop fission
- Reordering Statements
- Node Splitting
- Scalar expansion
- Loop Peeling
- Loop Fusion
- Loop Unrolling
- Loop Interchanging



• When the compiler does not vectorize automatically due to dependences the programmer can inform the compiler that it is safe to vectorize:

```
#pragma ivdep (ICC compiler)
```

```
#pragma ibm independent_loop (XLC compiler)
```



- This loop can be vectorized when k < -3 and k >= 0.
- Programmer knows that k>=0

S1

k =1

If  $(k \ge 0) \rightarrow$  no dependence or self-anti-dependence

a[0]=a[1]+b[0]

a[2]=a[1]+b[1

a[1]=a[2]+b[1]a[2]=a[3]+b[2]



If  $(k < 0) \rightarrow$  self-true dependence

Cannot be vectorized

Can be vectorized



- This loop can be vectorized when k < -3 and k >= 0.
- Programmer knows that k>=0

How can the programmer tell the compiler that  $k \ge 0$ ?



- This loop can be vectorized when k < -3 and k >= 0.
- Programmer knows that k>=0

Intel ICC provides the #pragma ivdep to tell the compiler that it is safe to ignore unknown dependences

```
#pragma ivdep
for (int i=val;i<LEN-k;i++)</pre>
```

```
a[i]=a[i+k]+b[i];
```

wrong results will be obtained if loop is vectorized when -3 < k < 0





| S124 and S124_1                                                                                                                                                       | S124_2                                                                                                                                 |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Intel Nehalem<br>Compiler report: Loop was not<br>vectorized. Existence of vector<br>dependence<br>Exec. Time scalar code: 6.0<br>Exec. Time vector code:<br>Speedup: | Intel Nehalem<br>Compiler report: Loop was<br>vectorized<br>Exec. Time scalar code: 6.0<br>Exec. Time vector code: 2.4<br>Speedup: 2.5 |



| S124_2                                                                           |
|----------------------------------------------------------------------------------|
| #pragma ibm independent_loo<br>needs AIX OS (we ran the<br>experiments on Linux) |
|                                                                                  |

t\_loop



• Programmer can disable vectorization of a loop when the when the vector code runs slower than the scalar code

#pragma novector (ICC compiler)

#pragma nosimd (XLC compiler)



Vector code can run slower than scalar code



Less locality when executing in vector mode

S1 S2 S3

S1 can be vectorized S2 and S3 cannot be vectorized (as they are)

S116

```
#pragma novector
for (int i=1;i<LEN;i++){
    a[i] = b[i] + c[i];
    d[i] = a[i] + e[i-1];
    e[i] = d[i] + c[i];
}</pre>
```

#### S116

Intel Nehalem Compiler report: Loop was partially vectorized Exec. Time scalar code: 14.7 Exec. Time vector code: 18.1 Speedup: 0.8



- It is also called loop fission.
- Divides loop control over different statements in the loop body.

```
for (i=1; i<LEN; i++) {
    a[i]= (float)sqrt(b[i])+
        (float)sqrt(c[i]);
    dummy(a,b,c);
}</pre>
```



- It is also called loop fission.
- Divides loop control over different statements in the loop body.

```
for (i=1; i<LEN; i++) {
    a[i]= (float)sqrt(b[i])+
        (float)sqrt(c[i]);
    dummy(a,b,c);
} for (i=1; i<LEN; i++)
    dummy(a,b,c);</pre>
```

Compiler cannot analyze the dummy function.
As a result, the compiler cannot apply loop distribution, because it does not know if it is a legal transformation
Programmer can apply loop distribution if legal.



S126

```
for (i=1; i<LEN; i++) {
    a[i]= (float)sqrt(b[i])+
        (float)sqrt(c[i]);
    dummy(a,b,c);
}</pre>
```

S126\_1

```
for (i=1; i<LEN; i++)
    a[i]= (float)sqrt(b[i])+
        (float)sqrt(c[i]);
for (i=1; i<LEN; i++)
    dummy(a,b,c);</pre>
```

| S126                                                                                                                               | S126_1                                                                                                                                                                     |
|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Intel Nehalem<br>Compiler report: Loop was not<br>vectorized<br>Exec. Time scalar code: 4.3<br>Exec. Time vector code:<br>Speedup: | Intel Nehalem<br>Compiler report:<br>- Loop 1 was vectorized.<br>- Loop 2 was not vectorized<br>Exec. Time scalar code: 5.1<br>Exec. Time vector code: 1.1<br>Speedup: 4.6 |



S126

```
for (i=1; i<LEN; i++) {
    a[i]= (float)sqrt(b[i])+
        (float)sqrt(c[i]);
    dummy(a,b,c);
}</pre>
```

S126\_1

```
for (i=1; i<LEN; i++)
    a[i]= (float)sqrt(b[i])+
        (float)sqrt(c[i]);
for (i=1; i<LEN; i++)
    dummy(a,b,c);</pre>
```

S126

#### S126\_1

IBM Power 7

Compiler report: Loop was not SIMD vectorized Exec. Time scalar code: 1.3 Exec. Time vector code: --Speedup: -- IBM Power 7 Compiler report: - Loop 1 was SIMD vectorized. - Loop 2 was not SIMD vectorized Exec. Time scalar code: 1.14 Exec. Time vector code: 1.0 Speedup: 1.14

I

#### **Reordering Statements**





#### **Reordering Statements**





S114

S114\_1

Intel Nehalem Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 12.6 Exec. Time vector code: --Speedup: --

Intel Nehalem Compiler report: Loop was vectorized. Exec. Time scalar code: 10.7 Exec. Time vector code: 6.2 Speedup: 1.7



### **Reordering Statements**



The IBM XLC compiler generated the same code in both cases

#### S114

S114\_1

IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.3 Exec. Time vector code: 1.8 Speedup: 1.8

#### IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.3 Exec. Time vector code: 1.8 Speedup: 1.8



## **Node Splitting**

```
for (int i=0;i<LEN-1;i++){
S1 a[i]=b[i]+c[i];
S2 d[i]=(a[i]+a[i+1])*(float)0.5;
}</pre>
```

```
for (int i=0;i<LEN-1;i++){
S0 temp[i]=a[i+1];
S1 a[i]=b[i]+c[i];
S2 d[i]=(a[i]+temp[i])*(float) 0.5;
}</pre>
```







## **Node Splitting**

S126

```
for (int i=0;i<LEN-1;i++){
    a[i]=b[i]+c[i];
    d[i]=(a[i]+a[i+1])*(float)0.5;
}</pre>
```



S126

S126\_1

Intel Nehalem Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 12.6 Exec. Time vector code: --Speedup: --

Intel Nehalem Compiler report: Loop was vectorized. Exec. Time scalar code: 13.2 Exec. Time vector code: 9.7 Speedup: 1.3



## **Node Splitting**

S126

```
for (int i=0;i<LEN-1;i++){
S1 a[i]=b[i]+c[i];
S2 d[i]=(a[i]+a[i+1])*(float)0.5;
}</pre>
```



```
for (int i=0;i<LEN-1;i++){
S0 temp[i]=a[i+1];
S1 a[i]=b[i]+c[i];
S2 d[i]=(a[i]+temp[i])*(float) 0.5;
}</pre>
```

S126

S126\_1

**IBM Power 7** 

Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 3.8 Exec. Time vector code: 1.7 Speedup: 2.2 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 5.1 Exec. Time vector code: 2.4 Speedup: 2.0



### **Scalar Expansion**

```
for (int i=0;i<n;i++){
S1 t = a[i];
S2 a[i] = b[i];
S3 b[i] = t;
}</pre>
```

```
for (int i=0;i<n;i++){
S1 t[i] = a[i];
S2 a[i] = b[i];
S3 b[i] = t[i];
}</pre>
```







### **Scalar Expansion**





### **Scalar Expansion**



```
S139_1
for (int i=0;i<n;i++){
  t[i] = a[i];
  a[i] = b[i];
  b[i] = t[i];
}</pre>
```

S139

#### S139\_1

**IBM Power 7** 

Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 0.28 Exec. Time vector code: 0.14 Speedup: 2 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 0.28 Exec. Time vector code: 0.14 Speedup: 2.0



- Remove the first/s or the last/s iteration of the loop into separate code outside the loop
- It is always legal, provided that no additional iterations are introduced.
- When the trip count of the loop is not constant the peeled loop has to be protected with additional runtime tests.
- This transformation is useful to enforce a particular initial memory alignment on array references prior to loop vectorization.





- Remove the first/s or the last/s iteration of the loop into separate code outside the loop
- It is always legal, provided that no additional iterations are introduced.
- When the trip count of the loop is not constant the peeled loop has to be protected with additional runtime tests.
- This transformation is useful to enforce a particular initial memory alignment on array references prior to loop vectorization.

,

;



```
a[0] = a[0] + a[0];
for (int i=0;i<LEN;i++){
    for (int i=1;i<LEN;i++){
        a[i] = a[i] + a[0];
        }
        }
</pre>
```

```
a[0]=a[0]+a[0]
a[1]=a[1]+a[0]
a[2]=a[2]+a[0]
```



Self true-dependence is not vectorized

After loop peeling, there are no dependences, and the loop can be vectorized





S127

#### S127\_1

Intel Nehalem Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 6.7 Exec. Time vector code: --Speedup: --

Intel Nehalem Compiler report: Loop was vectorized. Exec. Time scalar code: 6.6 Exec. Time vector code: 1.2 Speedup: 5.2



• This transformation switches the positions of one loop that is tightly nested within another loop.

```
for (i=0; i<LEN; i++) for (j=0; j<LEN; j++)
   A[i][j]=0.0;
```

for (j=0; j<LEN; j++) for (i=0; i<LEN; i++)</pre> A[i][j]=0.0;





Inner loop cannot be vectorized because of self-dependence



Loop interchange is legal No dependences in inner loop



| S228                          | S228_1                      |
|-------------------------------|-----------------------------|
| Intel Nehalem                 | Intel Nehalem               |
| Compiler report: Loop was not | Compiler report: Loop was   |
| vectorized.                   | vectorized.                 |
| Exec. Time scalar code: 2.3   | Exec. Time scalar code: 0.6 |
| Exec. Time vector code:       | Exec. Time vector code: 0.2 |
| Speedup:                      | Speedup: 3                  |





S228

S228\_1

IBM Power 7 Compiler report: Loop was not SIMD vectorized Exec. Time scalar code: 0.5 Exec. Time vector code: --Speedup: -- IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 0.2 Exec. Time vector code: 0.14 Speedup: 1.42



## Outline

#### 1. Intro

- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
    - Reductions
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization using intrinsics



### Reductions

• Reduction is an operation, such as addition, which is applied to the elements of an array to produce a result of a lesser rank.

```
Sum Reduction
Sum = 0;
for (int i=0;i<LEN;++i){
    sum += a[i];
}

    for (int i=0;i<LEN;++i){
        index = 0;
        for (int i=0;i<LEN;++i){
            if (a[i] > x) {
                 x = a[i];
                 index = i;
        }}
```



#### Reductions

}

#### S131 S132 sum = 0;x = a[0];for (int i=0;i<LEN;++i){</pre> index = 0;sum+= a[i]; for (int i=0;i<LEN;++i){</pre> if (a[i] > x) { x = a[i];index = i;}}

S131

S132

**Intel Nehalem Compiler report:** Loop was vectorized. Exec. Time scalar code: 5.2 **Exec. Time vector code:** 1.2 Speedup: 4.1

**Intel Nehalem** Compiler report: Loop was vectorized. **Exec. Time scalar code:** 9.6 Exec. Time vector code: 2.4 Speedup: 3.9



### Reductions

}

#### S131 S132 sum = 0;x = a[0];for (int i=0;i<LEN;++i){</pre> index = 0;sum+= a[i]; for (int i=0;i<LEN;++i){</pre> if (a[i] > x) { x = a[i];index = i;}}

S131

S132

#### **IBM Power 7**

**Compiler report:** Loop was SIMD vectorized Exec. Time scalar code: 1.1 **Exec. Time vector code:** 0.4 Speedup: 2.4

#### **IBM Power 7** Compiler report: Loop was not SIMD vectorized **Exec.** Time scalar code: 4.4 Exec. Time vector code: --Speedup: --



## Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
    - Induction variables
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics



• Induction variable is a variable that can be expressed as a function of the loop iteration variable

```
float s = (float)0.0;
for (int i=0;i<LEN;i++){
   s += (float)2.;
   a[i] = s * b[i];
}
```

 Induction variable is a variable that can be expressed as a function of the loop iteration variable

```
float s = (float)0.0; for (int i=0;i<LEN;i++){
for (int i=0;i<LEN;i++){
    a[i] = (float)2.*(i+1)*b[i];
    a[i] = s * b[i];
}</pre>
```





}

S133

```
for (int i=0;i<LEN;i++){</pre>
  s += (float)2.;
 a[i] = s * b[i];
```



The Intel ICC compiler generated the same vector code in both cases

S133

S133 1

**Intel Nehalem Compiler report:** Loop was vectorized. Exec. Time scalar code: 6.1 Exec. Time vector code: 1.9 Speedup: 3.1

**Intel Nehalem Compiler report:** Loop was vectorized. Exec. Time scalar code: 8.4 Exec. Time vector code: 1.9 Speedup: 4.2



```
S133
```

```
for (int i=0;i<LEN;i++){</pre>
  s += (float)2.;
 a[i] = s * b[i];
}
```

```
S133_1
```

```
float s = (float)0.0; for (int i=0; i<LEN; i++)
                          a[i] = (float)2.*(i+1)*b[i];
                          }
```

S133

#### S133 1

**IBM Power 7 Compiler report:** Loop was not SIMD vectorized Exec. Time scalar code: 2.7 Exec. Time vector code: --Speedup: --

**IBM Power 7 Compiler report:** Loop was SIMD vectorized Exec. Time scalar code: 3.7 Exec. Time vector code: 1.4 Speedup: 2.6



• Coding style matters:

```
for (int i=0;i<LEN;i++) {
    for (int i=0;i<LEN;i++){
        *a = *b + *c;
        a++; b++; c++;
    }
}</pre>
```

These codes are equivalent, but ...



S134

for (int i=0;i<LEN;i++) { for (int i=0;i<LEN;i++){</pre> \*a = \*b + \*c; a++; b++; c++; }

S134\_1

```
a[i] = b[i] + c[i];
```

S134



Intel Nehalem **Compiler report:** Loop was not vectorized. Exec. Time scalar code: 5.5 Exec. Time vector code: --Speedup: --

Intel Nehalem **Compiler report:** Loop was vectorized. Exec. Time scalar code: 6.1 Exec. Time vector code: 3.2 Speedup: 1.8





The IBM XLC compiler generated the same code in both cases

#### S134

S134\_1

IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 2.2 Exec. Time vector code: 1.0 Speedup: 2.2 IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 2.2 Exec. Time vector code: 1.0 Speedup: 2.2



## Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics



- Vector loads/stores load/store 128 consecutive bits to a vector register.
- Data addresses need to be 16-byte (128 bits) aligned to be loaded/stored
  - Intel platforms support aligned and unaligned load/stores
  - IBM platforms do not support unaligned load/stores



## Why data alignment may improve efficiency

- Vector load/store from aligned data requires one memory access
- Vector load/store from unaligned data requires multiple memory accesses and some shift operations



Reading 4 bytes from address 1 requires two loads



- To know if a pointer is 16-byte aligned, the last digit of the pointer address in hex must be 0.
- Note that if &b[0] is 16-byte aligned, and is a single precision array, then &b[4] is also 16-byte aligned

```
__attribute__ ((aligned(16))) float B[1024];
int main(){
    printf("%p, %p\n", &B[0], &B[4]);
}
Output:
    0x7fff1e9d8580, 0x7fff1e9d8590
```



- In many cases, the compiler cannot statically know the alignment of the address in a pointer
- The compiler assumes that the base address of the pointer is 16-byte aligned and adds a run-time checks for it
  - if the runtime check is false, then it uses another code (which may be scalar)

• Manual 16-byte alignment can be achieved by forcing the base address to be a multiple of 16.

```
__attribute__ ((aligned(16))) float b[N];
float* a = (float*) memalign(16,N*sizeof(float));
```

• When the pointer is passed to a function, the compiler should be aware of where the 16-byte aligned address of the array starts.

```
void func1(float *a, float *b,
float *c) {
    __assume_aligned(a, 16);
    __assume_aligned(b, 16);
    __assume_aligned(c, 16);
for int (i=0; i<LEN; i++) {
    a[i] = b[i] + c[i];
}
```



## **Data Alignment - Example**

```
float A[N] __attribute__((aligned(16)));
float B[N] __attribute__((aligned(16)));
float C[N] __attribute__((aligned(16)));
void test(){
for (int i = 0; i < N; i++){
   C[i] = A[i] + B[i];
}}</pre>
```



## **Data Alignment - Example**

```
float A[N] __attribute__((aligned(16)));
float B[N] __attribute__((aligned(16)));
float C[N] __attribute__((aligned(16)));
```

```
void test1(){
__m128 rA, rB, rC;
for (int i = 0; i < N; i+=4){
  rA = _mm_load_ps(&A[i]);
  rB = _mm_load_ps(&B[i]);
  rC = _mm_add_ps(rA, rB);
  _mm_store_ps(&C[i], rC);
}}</pre>
```

```
void test3(){
   __m128 rA, rB, rC;
for (int i = 1; i < N-3; i+=4){
   rA = _mm_loadu_ps(&A[i]);
   rB = _mm_loadu_ps(&B[i]);
   rC = _mm_add_ps(rA, rB);
   __mm_storeu_ps(&C[i], rC);
}}</pre>
```

```
void test2(){
   __m128 rA, rB, rC;
for (int i = 0; i < N; i+=4){
   rA = _mm_loadu_ps(&A[i]);
   rB = _mm_loadu_ps(&B[i]);
   rC = _mm_add_ps(rA,rB);
   __mm_storeu_ps(&C[i], rC);
}}</pre>
```

| Nanosecond per iteration |            |          |         |
|--------------------------|------------|----------|---------|
|                          | Core 2 Duo | Intel i7 | Power 7 |
| Aligned                  | 0.577      | 0.580    | 0.156   |
| Aligned (unaligned ld)   | 0.689      | 0.581    | 0.241   |
| Unaligned                | 2.176      | 0.629    | 0.243   |



# Alignment in a struct

```
struct st{
    char A;
    int B[64];
    float C;
    int D[64];
};
int main(){
    st s1;
    printf("%p, %p, %p, %p\n", &s1.A, s1.B, &s1.C, s1.D);}
```

Output: 0x7fffe6765f00, 0x7fffe6765f04, 0x7fffe6766004, 0x7fffe6766008

Arrays B and D are not 16-bytes aligned (see the address)



# Alignment in a struct

```
struct st{
    char A;
    int B[64] __attribute__ ((aligned(16)));
    float C;
    int D[64] __attribute__ ((aligned(16)));
};
int main(){
    st s1;
    printf("%p, %p, %p, %p\n", &s1.A, s1.B, &s1.C, s1.D);}
```

Output: 0x7fff1e9d8580, 0x7fff1e9d8590, 0x7fff1e9d8690, 0x7fff1e9d86a0

- Arrays B and D are aligned to 16-bytes (notice the 0 in the 4 least significant bits of the address)
- Compiler automatically does padding



## Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics



• Can the compiler vectorize this loop?

```
void func1(float *a,float *b, float *c){
   for (int i = 0; i < LEN; i++) {
        a[i] = b[i] + c[i];
}</pre>
```



• Can the compiler vectorize this loop?

```
float* a = &b[1];
...
void func1(float *a,float *b, float *c)
{
    for (int i = 0; i < LEN; i++)
        a[i] = b[i] + c[i];
}
b[1]= b[0] + c[0]
    b[2] = b[1] + c[1]
```



• Can the compiler vectorize this loop?

```
float* a = &b[1];
...
void func1(float *a,float *b, float *c)
{
  for (int i = 0; i < LEN; i++)
      a[i] = b[i] + c[i];
}
      a and b are aliasing
  There is a self-true dependence
      Vectorizing this loop would
      be illegal
```



- To vectorize, the compiler needs to guarantee that the pointers are not aliased.
- When the compiler does not know if two pointer are alias, it still vectorizes, but needs to add up-to  $O(n^2)$  run-time checks, where *n* is the number of pointers

When the number of pointers is large, the compiler may decide to not vectorize

```
void func1(float *a, float *b, float *c){
  for (int i=0; i<LEN; i++)
     a[i] = b[i] + c[i];
}</pre>
```



- Two solutions can be used to avoid the run-time checks
- 1. static and global arrays
- 2. \_\_restrict\_\_ attribute



#### 1. Static and Global arrays

```
__attribute__ ((aligned(16))) float a[LEN];
__attribute__ ((aligned(16))) float b[LEN];
__attribute__ ((aligned(16))) float c[LEN];
```

```
void func1(){
for (int i=0; i<LEN; i++)
    a[i] = b[i] + c[i];
}</pre>
```

```
int main() {
...
```

```
func1();
```

```
}
```

```
1. ____restrict___keyword
```

```
void func1(float* __restrict__ a,float* __restrict__ b,
float* __restrict__ c) {
  __assume_aligned(a, 16);
  __assume_aligned(b, 16);
  __assume_aligned(c, 16);
  for int (i=0; i<LEN; i++)</pre>
       a[i] = b[i] + c[i];
}
int main() {
   float* a=(float*) memalign(16,LEN*sizeof(float));
   float* b=(float*) memalign(16,LEN*sizeof(float));
   float* c=(float*) memalign(16,LEN*sizeof(float));
   func1(a,b,c);
}
```



# Aliasing – Multidimensional arrays

• Example with 2D arrays: pointer-to-pointer declaration.

```
void func1(float** __restrict__ a,float**
__restrict__ b, float** __restrict__ c) {
for (int i=0; i<LEN; i++)
    for (int j=1; j<LEN; j++)
        a[i][j] = b[i][j-1] * c[i][j];
}</pre>
```

### Aliasing – Multidimensional arrays

Example with 2D arrays: pointer-to-pointer declaration.
 void func1(float\*\* \_\_restrict\_\_ a,float\*\* \_\_restrict\_\_ b, float\*\* \_\_restrict\_\_ c) {
 for (int i=0; i<LEN; i++)
 for (int j=1; j<LEN; j++)
 a[i][j] = b[i][j-1] \* c[i][j];</li>



\_\_restrict\_\_ only qualifies the first dereferencing of c;

Nothing is said about the arrays that can be accessed through c[i]

## Aliasing – Multidimensional arrays

Example with 2D arrays: pointer-to-pointer declaration.
 void func1(float\*\* \_\_restrict\_\_ a,float\*\* \_\_restrict\_\_ b, float\*\* \_\_restrict\_\_ c) {
 for (int i=0; i<LEN; i++)
 for (int j=1; j<LEN; j++)
 a[i][j] = b[i][j-1] \* c[i][j];</li>



\_\_restrict\_\_ only qualifies the first dereferencing of c;

Nothing is said about the arrays that can be accessed through c[i]

Intel ICC compiler, version 11.1 will vectorize this code.

Previous versions of the Intel compiler or compilers from other vendors, such as IBM XLC, will not vectorize it.



# Aliasing – Multidemensional Arrays

- Three solutions when \_\_restrict\_\_ does not enable vectorization
- 1. Static and global arrays
- 2. Linearize the arrays and use \_\_\_restrict\_\_ keyword
- 3. Use compiler directives



# Aliasing – Multidimensional arrays

1. Static and Global declaration

```
_attribute__ ((aligned(16))) float a[N][N];
void t(){
    a[i][j]....
}
int main() {
    ü();
}
```



# Aliasing – Multidimensional arrays

```
2. Linearize the arrays
```

```
void t(float* __restrict__ A){
    //Access to Element A[i][j] is now A[i*128+j]
    ....
}
```

```
int main() {
    float* A = (float*) memalign(16,128*128*sizeof(float));
    ...
    t(A);
}
```



# Aliasing – Multidimensional arrays

3. Use compiler directives:

```
#pragma ivdep (Intel ICC)
#pragma disjoint(IBM XLC)
```

```
void func1(float **a, float **b, float **c) {
  for (int i=0; i<m; i++) {
    #pragma ivdep
    for (int j=0; j<LEN; j++)
        c[i][j] = b[i][j] * a[i][j];
}}</pre>
```



## Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics



• Array of a struct

```
typedef struct{int x, y, z}
point;
point pt[LEN];
```

```
for (int i=0; i<LEN; i++) {
    pt[i].y *= scale;
}</pre>
```

| point pt[N] | <b>x</b> <sub>0</sub> | <b>У</b> 0 | Z <sub>0</sub> | <b>X</b> <sub>1</sub> | У <sub>1</sub> | Z <sub>1</sub> | <b>x</b> <sub>2</sub> | У <sub>2</sub> | Z <sub>2</sub> | <b>X</b> <sub>3</sub> | У <sub>3</sub> | Z <sub>3</sub> |
|-------------|-----------------------|------------|----------------|-----------------------|----------------|----------------|-----------------------|----------------|----------------|-----------------------|----------------|----------------|
|             | pt[0]                 |            | pt[1]          |                       |                | pt[2]          |                       |                | pt[3]          |                       |                |                |



• Array of a struct

```
typedef struct{int x, y, z}
point;
point pt[LEN];
```

```
for (int i=0; i<LEN; i++) {
    pt[i].y *= scale;
}</pre>
```





• Array of a struct

```
typedef struct{int x, y, z}
     point;
     point pt[LEN];
     for (int i=0; i<LEN; i++) {</pre>
        pt[i].y *= scale;
      }
               vector load vector load vector load
                                      \mathbf{Z}_1 \mathbf{X}_2 \mathbf{Y}_2 \mathbf{Z}_2 \mathbf{X}_3 \mathbf{Y}_3 \mathbf{Z}_3
point pt[N] X<sub>0</sub>
                   |Y_0| |Z_0| |X_1|
vector register
                          y<sub>0</sub>
                                    V_2
(I need)
```

 Array of a struct
 typedef struct{int x, y, z} point; point; point pt[LEN];
 for (int i=0; i<LEN; i++) { pt[i].y \*= scale;

```
\begin{array}{c} \begin{array}{c} \text{vector load} & \text{vector load} \\ \text{point pt[N]} & \textbf{X}_0 & \textbf{y}_0 & \textbf{z}_0 & \textbf{x}_1 & \textbf{y}_1 & \textbf{z}_1 & \textbf{x}_2 & \textbf{y}_2 & \textbf{z}_2 \\ \hline \text{vector register} & \textbf{y}_0 & \textbf{y}_1 & \textbf{y}_2 & \textbf{y}_3 \end{array}
```

```
    Arrays
```

int ptx[LEN], int pty[LEN],
int ptz[LEN];

```
for (int i=0; i<LEN; i++) {
    pty[i] *= scale;
}</pre>
```





(I need)

#### S135

```
typedef struct{int x, y, z}
point;
point pt[LEN];
```

```
for (int i=0; i<LEN; i++) {
    pt[i].y *= scale;
}</pre>
```

#### S135\_1

```
int ptx[LEN], int pty[LEN],
int ptz[LEN];
```

```
for (int i=0; i<LEN; i++) {
    pty[i] *= scale;
}</pre>
```

S135

#### S135\_1

Intel Nehalem Compiler report: Loop was not vectorized. Vectorization possible but seems inefficient Exec. Time scalar code: 6.8 Exec. Time vector code: --Speedup: --

```
Intel Nehalem
Compiler report: Loop was
vectorized.
Exec. Time scalar code: 4.8
Exec. Time vector code: 1.3
Speedup: 3.7
```



#### S135

```
typedef struct{int x, y, z}
point;
point pt[LEN];
```

```
for (int i=0; i<LEN; i++) {
    pt[i].y *= scale;
}</pre>
```

#### S135\_1

```
int ptx[LEN], int pty[LEN],
int ptz[LEN];
```

```
for (int i=0; i<LEN; i++) {
    pty[i] *= scale;
}</pre>
```

S135

#### S135\_1

#### **IBM Power 7**

Compiler report: Loop was not SIMD vectorized because it is not profitable to vectorize Exec. Time scalar code: 2.0 Exec. Time vector code: --Speedup: -- IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 1.8 Exec. Time vector code: 1.5 Speedup: 1.2



## Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics



### **Conditional Statements – I**

- Loops with conditions need #pragma vector always
  - Since the compiler does not know if vectorization will be profitable
  - The condition may prevent from an exception

```
#pragma vector always
for (int i = 0; i < LEN; i++){
    if (c[i] < (float) 0.0)
        a[i] = a[i] * b[i] + d[i];
}</pre>
```



#### **Conditional Statements – I**

S137

S137\_1

```
for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){
 if (c[i] < (float) 0.0) if (c[i] < (float) 0.0)
   a[i] = a[i] * b[i] + d[i];
```

```
#pragma vector always
    a[i] = a[i] * b[i] + d[i];
}
```

S137

#### S137 1

**Intel Nehalem Compiler report:** Loop was not vectorized. Condition may protect exception Exec. Time scalar code: 10.4 Exec. Time vector code: --Speedup: --

**Intel Nehalem Compiler report:** Loop was vectorized. **Exec. Time scalar code**: 10.4 Exec. Time vector code: 5.0 Speedup: 2.0



#### **Conditional Statements**

Compiler removes *if conditions* when generating vector code

```
for (int i = 0; i < LEN; i++){
    if (c[i] < (float) 0.0)
        a[i] = a[i] * b[i] + d[i];
}</pre>
```



### **Compiler Directives**

• Compiler vectorizes many loops, but many more can be vectorized if the appropriate directives are used

| Compiler Hints for Intel ICC   | Semantics                               |
|--------------------------------|-----------------------------------------|
| #pragma ivdep                  | Ignore assume data dependences          |
| #pragma vector always          | override efficiency heuristics          |
| #pragma novector               | disable vectorization                   |
| restrict                       | assert exclusive access through pointer |
| attribute ((aligned(int-val))) | request memory alignment                |
| memalign(int-val,size);        | malloc aligned memory                   |
| assume_aligned(exp, int-val)   | assert alignment property               |



### **Compiler Directives**

• Compiler vectorizes many loops, but many more can be vectorized if the appropriate directives are used

| Compiler Hints for IBM XLC     | Semantics                               |
|--------------------------------|-----------------------------------------|
| #pragma ibm independent_loop   | Ignore assumed data dependences         |
| #pragma nosimd                 | disable vectorization                   |
| restrict                       | assert exclusive access through pointer |
| attribute ((aligned(int-val))) | request memory alignment                |
| memalign(int-val,size);        | malloc aligned memory                   |
| alignx (int-val, exp)          | assert alignment property               |



## Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics



#### **Access the SIMD through intrinsics**

- Intrinsics are vendor/architecture specific
- We will focus on the Intel vector intrinsics
- Intrinsics are useful when
  - the compiler fails to vectorize
  - when the programmer thinks it is possible to generate better code than the one produced by the compiler



#### **The Intel SSE intrinsics Header file**

- SSE can be accessed using intrinsics.
- You must use one of the following header files: #include <xmmintrin.h> (for SSE) #include <emmintrin.h> (for SSE2) #include <pmmintrin.h> (for SSE3) #include <smmintrin.h> (for SSE4)
  - These include the prototypes of the intrinsics.



#### Intel SSE intrinsics Data types

- We will use the following data types:
  - \_\_m128 packed single precision (vector XMM register)
    \_\_m128d packed double precision (vector XMM register)
    \_\_m128i packed integer (vector XMM register)
- Example

```
#include <xmmintrin.h>
int main () {
    ...
    __m128 A, B, C; /* three packed s.p. variables */
    ...
}
```



#### **Intel SSE intrinsic Instructions**

• Intrinsics operate on these types and have the format:

\_mm\_instruction\_suffix(...)

 Suffix can take many forms. Among them: ss scalar single precision ps packed (vector) singe precision sd scalar double precision pd packed double precision si# scalar integer (8, 16, 32, 64, 128 bits) su# scalar unsigned integer (8, 16, 32, 64, 128 bits)



#### Intel SSE intrinsics Instructions – Examples

Load four 16-byte aligned single precision values in a vector:

float a[4]={1.0,2.0,3.0,4.0};//a must be 16-byte aligned
\_\_m128 x = \_mm\_load\_ps(a);

Add two vectors containing four single precision values:
 \_\_m128 a, b;
 \_\_m128 c = \_\_mm\_add\_ps(a, b);



#### **Intrinsics (SSE)**

```
#define n 1024
__attribute__ ((aligned(16)))
float a[n], b[n], c[n];
```

```
int main() {
for (i = 0; i < n; i++) {
   c[i]=a[i]*b[i];
  }
}</pre>
```

```
#include <xmmintrin.h>
#define n 1024
__attribute__((aligned(16))) float
a[n], b[n], c[n];
```

```
int main() {
   __m128 rA, rB, rC;
for (i = 0; i < n; i+=4) {
   rA = _mm_load_ps(&a[i]);
   rB = _mm_load_ps(&b[i]);
   rC= _mm_mul_ps(rA, rB);
   _mm_store_ps(&c[i], rC);
}}</pre>
```



#### Intel SSE intrinsics A complete example

#define n 1024

Header file

```
int main() {
float a[n], b[n], c[n];
for (i = 0; i < n; i+=4) {
   c[i:i+3]=a[i:i+3]+b[i:i+3];
  }
}</pre>
```

```
#include <xmmintrin.h>
#define n 1024
__attribute__((aligned(16))) float
a[n], b[n], c[n];
```

```
int main() {
   __m128 rA, rB, rC;
for (i = 0; i < n; i+=4) {
   rA = _mm_load_ps(&a[i]);
   rB = _mm_load_ps(&b[i]);
   rC= _mm_mul_ps(rA, rB);
   _mm_store_ps(&c[i], rC);
}}</pre>
```



#### Intel SSE intrinsics A complete example

```
#define n 1024
```

```
#include <xmmintrin.h>
#define n 1024
__attribute__((aligned(16))) float
a[n], b[n], c[n];
```

```
int main() {
    __m128 rA, rB, rC;
for (i = 0; i < n; i+=4) {
    rA = _mm_load_ps(&a[i]);
    rB = _mm_load_ps(&b[i]);
    rC= _mm_mul_ps(rA, rB);
    _mm_store_ps(&c[i], rC);
}}</pre>
```



#### Intel SSE intrinsics A complete example

#define n 1000

```
#include <xmmintrin.h>
#define n 1024
__attribute__((aligned(16))) float
a[n], b[n], c[n];
```

```
int main() {
   __m128 rA, rB, rC;
for (i = 0; i < n; i+=4) {
   rA = _mm_load_ps(&a[i]);
   rB = _mm_load_ps(&b[i]);
   rC= _mm_mul_ps(rA, rB);
   _mm_store_ps(&c[i], rC);
}}</pre>
```



# **Node Splitting**

```
for (int i=0;i<LEN-1;i++){
S1 a[i]=b[i]+c[i];
S2 d[i]=(a[i]+a[i+1])*(float)0.5;
}</pre>
```

```
for (int i=0;i<LEN-1;i++){
S0 temp[i]=a[i+1];
S1 a[i]=b[i]+c[i];
S2 d[i]=(a[i]+temp[i])*(float) 0.5;
}</pre>
```







```
for (int i=0;i<LEN-1;i++){
    a[i]=b[i]+c[i];
    d[i]=(a[i]+a[i+1])*(float)0.5;
}</pre>
```

```
for (int i=0;i<LEN-1;i++){
   temp[i]=a[i+1];
   a[i]=b[i]+c[i];
   d[i]=(a[i]+temp[i])*(float)0.5;
}</pre>
```

```
Which code runs faster ?
```

```
#define n 1000
int main() {
```

#include <xmmintrin.h>

```
__m128 rA1, rA2, rB, rC, rD;
__m128 r5=_mm_set1_ps((float)0.5)
for (i = 0; i < LEN-4; i+=4) {
    rA2= _mm_loadu_ps(&a[i+1]);
    rB= _mm_load_ps(&b[i]);
    rC= _mm_load_ps(&c[i]);
    rA1= _mm_add_ps(rB, rC);
    rD= _mm_mul_ps(_mm_add_ps(rA1, rA2), r5);
    _mm_store_ps(&a[i], rA1);
    _mm_store_ps(&d[i], rD);
}}
```

Why?



#### S126

```
for (int i=0;i<LEN-1;i++){
    a[i]=b[i]+c[i];
    d[i]=(a[i]+a[i+1])*(float)0.5;
}</pre>
```

#### S126\_1

```
for (int i=0;i<LEN-1;i++){
   temp[i]=a[i+1];
   a[i]=b[i]+c[i];
   d[i]=(a[i]+temp[i])*(float)0.5;
}</pre>
```

#### S126\_2

```
#include <xmmintrin.h>
#define n 1000
```

```
int main() {
    __m128 rA1, rA2, rB, rC, rD;
    __m128 r5=_mm_set1_ps((float)0.5)
for (i = 0; i < LEN-4; i+=4) {
    rA2= _mm_loadu_ps(&a[i+1]);
    rB= _mm_load_ps(&b[i]);
    rC= _mm_load_ps(&c[i]);
    rA1= _mm_add_ps(rB, rC);
    rD= _mm_mul_ps(_mm_add_ps(rA1, rA2), r5);
    _mm_store_ps(&a[i], rA1);
    _mm_store_ps(&d[i], rD);
}}</pre>
```



S126

S126\_1

Intel Nehalem Compiler report: Loop was not vectorized. Existence of vector dependence Exec. Time scalar code: 12.6 Exec. Time vector code: --Speedup: --

Intel Nehalem Compiler report: Loop was vectorized. Exec. Time scalar code: 13.2 Exec. Time vector code: 9.7 Speedup: 1.3

S126\_2

Intel Nehalem Exec. Time intrinsics: 6.1 Speedup (versus vector code): 1.6



S126

S126\_1

| IBM Power 7                    |
|--------------------------------|
| Compiler report: Loop was SIMD |
| vectorized                     |
| Exec. Time scalar code: 3.8    |
| Exec. Time vector code: 1.7    |
| Speedup: 2.2                   |

IBM Power 7 Compiler report: Loop was SIMD vectorized Exec. Time scalar code: 5.1 Exec. Time vector code: 2.4 Speedup: 2.0

S126\_2

IBM Power 7 Exec. Time intrinsics: 1.6 Speedup (versus vector code): 1.5



### Summary

- Microprocessor vector extensions can contribute to improve program performance and the amount of this contribution is likely to increase in the future as vector lengths grow.
- Compilers are only partially successful at vectorizing
- When the compiler fails, programmers can
  - add compiler directives
  - apply loop transformations
- If after transforming the code, the compiler still fails to vectorize (or the performance of the generated code is poor), the only option is to program the vector extensions directly using intrinsics or assembly language.



#### References

- Michael Voss, Software and Services Group, Intel. Topics in Loop Vectorization.
- María Garzarán, Saeed Maleki, William Gropp and David Padua. Program Optimization Through Loop Vectorization. UIUC.
- Daniel Kusswurm Modern X86 Assembly Language Programming.