Application Experiences with Heterogeneous Computing Architecture : "SIMDization"

Hongsuk Yi

hsyi@kisti.re.kr KISTI Supercomputing Center (Korea)

### What do we do?

- Parallel Programming Models in the HC Jungle
  - Heterogeneous computing Hardware and Software
    - HW: Xeon Phi Coprocessors, AMD GPU, NVIDIA GPUs
    - SW: OpenCL, CUDA, Offload, OpenHMPP, MPI, OpenMP, Hybrid
- Main interests: Scientific Libaraies
  - GSL-CL provides high level C interfaces for GNU Scientific Library routines on CPU and GPUs using OpenCL.
    - GSL-CL (http://sourceforge.net/projects/gsl-cl/develop)
  - Development of Parallel application algorithms for nanomaterials
    - KMC (KISTI Monte Carlo)
    - KMD (KISTI Molecular Dynamics)

# Scaling for Multicore and Manycore

- Increasing Parallelism
  - Core scaling (cores)
  - Thread scaling/core
  - Data level parallelism (SIMD) scaling
    - SIMDization (Single Instruction Multiple Data) or Vectorization

#### – Intel AVX

- 256-bit wide register(ymm), 4 doubles/ instruction
- Shipped with Sandy bridge

#### – Intel Phi

• 512-bit wide register(), 8 doubles/ instruction for DP

#### What is vector

- Scalar computing
  - c = a + b, where a, b, c mean
    scalar floating-point variables.
  - ex.  $a = 1 * e^{-2}$
- Vector computing
  - c = a + b, where a, b, c mean arrays(vector) of floating-point variable and ' + ' means element-wise addition

- Both addition operations are executed per one CPU cycle.
- If vectors(floating point arrays) have two elements, 200 % speed is expected than scalar addition.
- Xeon Phi supports vectors with 8 elements in double precision.

### SIMD Fused Multiply Add



#### KMD : KISTI Molecular Dynamics

- Empirical Potential MD
  - is very good algorithm
  - Newtonian EOM + Rather simple form of potentials
  - Our choice is Tersoff potential on covalent C, Si, Ge
    - J. Tersoff, PRL 56, 632 (1986), PRB 37, 6991 (1988)



## "Native" mode is very easy



Near-theoretical linear performance scaling with the number of cores





NISN-CCS mini-workshop

### **SIMDization Strategies**

- Auto Vectorization
  Vij = fc \* (A\*exp(-lambda1 \* rij) B\*bij\*exp(-lambda2 \* rij));
- Intrinsic (SSE, AVX, MIC, Phi)
  v\_tmp = \_mm256\_pow\_pd(\_mm256\_mul\_pd(beta, zij), n1))
- Class (SSE, AVX, MIC, Phi)
  F64vec4 v\_tmp = pow(F64vec4(beta) \* zij, F64vec4(n))
- Assembly

vmulpd 160(%rsp), %ymm1, %ymm0 call \_svml\_pow4

OpenCL (AMD GPU, HD5870, HD6950)
 float4 Bij = pow(1.0 + tmp, -1/(2\*n));

## Implementation

| (a) | vmovups 224(%rsp), %ymm1<br>vmulpd 192(%rsp),%ymm6, %ymm0<br>movq 48(%rbp), %rbx<br>call _svml_pow4<br>vmovups 256 (%rsp), %ymm1<br>vaddpd .L_2il0floatpacket.53(%rip), %ymm0, %ymm0<br>call svml pow4                                                                                                                                                                                                        | vmovupd (%rsp), %<br>vmovupd 640(%rsp), %ymm4<br>vmulpd 96(%rsp), %ymm2, %ymm3<br>vmulpd 64(%rsp), %ymm0, %ymm1<br>vmulpd 128(%rsp), %ymm4, %ymm0<br>vmovupd %ymm3, 192(%rsp)<br>vmovupd %ymm1, 160(%rsp)<br>call _svml_pow4 |  |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|     | vmovupd 640(%rsp), %ymm1<br>vmovupd %ymm0, (%rsp)<br>vmulpd 160(%rsp), %ymm1, %ymm0<br>call _svml_pow4                                                                                                                                                                                                                                                                                                        | Assembly                                                                                                                                                                                                                     |  |
| (b) | v_tmp = _mm256_pow_pd(_mm256_mul_pd(beta, zij), n1))    Intrinsic      bij = _mm256_pow_pd(_mm256_add_pd(_mm256_set1_pd(1.0), v_tmp), n2);    v_tmp = _mm256_mul_pd(_mm256_set1_pd(A), _mm256_exp_pd(_mm256_mul_pd(lambda1, rij)));      v_tmp1 = _mm256_mul_pd(_mm256_mul_pd(bij, B), _mm256_exp_pd(_mm256_mul_pd(lambda2, rij)));    _mm256_store_pd(Vij, _mm256_mul_pd(fc, _mm256_sub_pd(v_tmp, v_tmp1))); |                                                                                                                                                                                                                              |  |
| (c) | $\label{eq:constraint} \begin{array}{ll} F64vec4 \ \textbf{v\_tmp} = pow(F64vec4(beta) * \textbf{zij}, \ F64vec4(n)); & \textbf{Class} \\ F64vec4 \ \textbf{bij} = pow(F64vec4(1.0) + \textbf{v\_tmp}, \ F64vec4(-1/(2*n))); \\ Storeu(Vij, \ \textbf{fc}*(F64vec4(A)*exp(F64vec4(-lambda1) * \textbf{rij}) - F64vec4(B)*\textbf{bij}*exp(F64vec4(-lambda2) * \textbf{rij}))); \end{array}$                   |                                                                                                                                                                                                                              |  |
| (d) | tmp = pow(beta * zij, n);<br>bij = pow(1.0 + tmp, -1/(2*n));<br>Vij = fc * (A*exp(-lambda1 * rij) - B*bij*exp(-lambda2 * rij));                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                              |  |
| (e) | <i>float4</i> <b>tmp</b> = pow(beta * <b>zij</b> , n));<br><i>float4</i> <b>bij</b> = pow(1.0 + <b>tmp</b> , -1/(2*n));<br><i>float4</i> <b>Vij</b> = <b>fc</b> * (A*exp(-lambda1 * <b>rij</b> ) - B*bij*exp(-lambda2 * <b>ri</b>                                                                                                                                                                             | i));                                                                                                                                                                                                                         |  |
|     | $b_{ij} = \left(1 + \left(\beta\zeta_{ij}\right)^n\right)^{-1/2n}$ $V_{ij} = f_c(r_{ij})[A \exp(-\lambda_1 r_{ij}) - b_{ij}B \exp(-\lambda_2 r_{ij})]$                                                                                                                                                                                                                                                        |                                                                                                                                                                                                                              |  |
|     | NISN-CCS mini-workshop                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                                                                                                                                              |  |

#### Data structure : AOS vs SOA

#### • Array of structure (AOS) for random access

- Array of 3-dimensional position vectors
- double position[3 \* n];
- {x1, y1, z1, x2, y2, z2, ... } which is usual coding style.



- Structure of array (SOA) for contiguous acess
  - double position\_x[n], position\_y[n], position\_z[n];
  - {x1, x2, x3, ...}, {y1, y2, y3, ...}, {z1, z2, z3, ...}
  - SOA should be careful because it accompanies over all rewriting the code completely



#### Summary

- We showed simple SIMDization for the Tersoff algorithm is not efficient.
  - Modified the algorithm through profiling and data dependancy
- Full SIMDization gives over 50% performance of the theoretical peak in both AVX and Intel Xeon Phi.
  - SIMDization is easily merged into well established MPI and OpenMP parallelization.
- SIMDization will be a key factor gradually in HPC of molecular dynamics.

#### Contact

Hongsuk Yi, Ph.D. KISTI Supercomputing Center <u>hsyi@kisti.re.kr</u>

#### Thanks to Hogyun Jeong (KISTI) Seungmin Lee (KISTI) Heungsik Kim (KISTI) Byoungwon Choe (Intel)