# Performance of Electronic Structure Calculations on Built-in FPGA Systems

**Seungmin Lee** 

(E: smlee76@kisti.re.kr)

Intel® Parallel Computing Center Korea Institute of Science and Technology Information (KISTI)







Introduction

- Built-in FPGA System
- Methodology
- Results
- Conclusion

## **Accelerator/Co-processor Devices**





## **Processors with a Built-in Accelerator**





## Tegra K1 (ARM Cortex-A15 with NVIDIA GPU)



## AMD APU Kaveri(2013)



## Intel Broadwell Xeon with a Built-in FPGA







# **Workload for Test**

#### **Electronic Structures: Sparse Matrix-vector Multiplications**



### **Flow of NanoElectronics Modeling**



Schrödinger Equation: Electronic Structure

 $H\Psi = E\Psi$ 

Eigenvalue problem

#### Schrödinger Eqs. w/ LANCZOS Algorithm

- → C. Lanczos, J. Res. Natl. Bur. Stand. 45, 255
- Normal Eigenvalue Problem (Electronic Structure)
- Hamiltonian is always symmetric
- Steps for Iteration: Purely Scalable Algebraic Ops.



Compressed Sparse Row(CSR) Format

• Storage for Large-scale Hamiltonian Matrices.

# **Workload for Test**

#### **Execution time of LANCZOS algorithm**

# OCC KISTI

#### Schrödinger Eqs. w/ LANCZOS Algorithm

→ C. Lanczos, J. Res. Natl. Bur. Stand. 45, 255

- Normal Eigenvalue Problem (Electronic Structure)
- Hamiltonian is always symmetric
- Steps for Iteration: Purely Scalable Algebraic Ops.

$$\begin{array}{c} \mathbf{v}_{i}: (\mathrm{Nx1}) \text{ vectors } (i = 0, \dots, \mathbf{K}); \mathbf{a}_{i} \text{ and } \mathbf{b}_{i}: \text{ scalars } (i = 1, \dots, \mathbf{K}) \\ \mathbf{v}_{0} \leftarrow \mathbf{0}, \mathbf{v}_{1} = \text{ random vector with norm } 1; \\ \mathbf{b}_{1} \leftarrow \mathbf{0}; \\ 100p \text{ for } (j=1; j \leq \mathbf{K}; j++) \\ \mathbf{w}_{j} \leftarrow \mathbf{Av}_{j}; \\ \mathbf{a}_{j} \leftarrow \mathbf{w}_{j} \cdot \mathbf{v}_{j}; \\ \mathbf{a}_{j} \leftarrow \mathbf{w}_{j} \cdot \mathbf{a}_{j}\mathbf{v}_{j} - \mathbf{b}_{j}\mathbf{v}_{j-1}; \\ \mathbf{b}_{j+1} \leftarrow ||\mathbf{w}_{j}||; \\ \mathbf{v}_{j+1} \leftarrow \mathbf{w}_{j}/\mathbf{b}_{j+1}; \\ \text{ construct T matrix; } \end{array} \right) T = \left( \begin{array}{cccc} a_{1} & b_{2} & 0 & \cdots & \cdots & 0 \\ b_{2} & a_{2} & b_{3} & \cdots & \vdots \\ 0 & b_{3} & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & b_{k-1} & 0 \\ \vdots & & b_{k-1} & a_{k-1} & b_{k} \\ 0 & \cdots & \cdots & 0 & b_{k} & a_{k} \end{array} \right)$$



MCDRAM

eigenvalues(others)

MemoryOP

0%

DRAM

eigenvalues(dstebz)

Vvdotproduct

#### Execution Time Ratio of Lanczos Algorithm

CACHE

mvmultiplier loop

Others

# **Workload for Test**

Serial code of SpMV



#### Schrödinger Eqs. w/ LANCZOS Algorithm

→ C. Lanczos, J. Res. Natl. Bur. Stand. 45, 255

- Normal Eigenvalue Problem (Electronic Structure)
- Hamiltonian is always symmetric

• Steps for Iteration: Purely Scalable Algebraic Ops.

```
      \mathbf{v}_{i}: (Nx1) \text{ vectors } (i = 0, ..., \mathbf{K}); \mathbf{a}_{i} \text{ and } \mathbf{b}_{i}: \text{ scalars } (i = 1, ..., \mathbf{K}) 
      \mathbf{v}_{0} \leftarrow \mathbf{0}, \mathbf{v}_{1} = \text{ random vector with norm } 1; 
      \mathbf{b}_{1} \leftarrow \mathbf{0}; 
      \mathbf{b}_{0} \leftarrow \mathbf{0}; 
       \mathbf{b}
```

```
for (unsigned int i = 0; i < nSize; i++) {
    double real_sum = 0.0;
    double imaginary_sum = 0.0;
    const unsigned int nSubStart = pMatrixRow[i];
    const unsigned int nSubEnd = pMatrixRow[i + 1];</pre>
```

```
for (unsigned int j = nSubStart; j < nSubEnd; j++) {
   const unsigned int nColIndex = pMatrixColumn[j];
   const double m_real = pMatrixReal[j];
   const double m_imaginary = pMatrixImaginary[j];
   const double v_real = pVectorReal[nColIndex];
   const double v_imaginary = pVectorImaginary[nColIndex];</pre>
```

```
real_sum += m_real * v_real – m_imaginary * v_imaginary;
imaginary_sum += m_real * v_imaginary + m_imaginary * v_real;
```

pResultReal[i] = real\_sum; pResultImaginary[i] = imaginary\_sum;

}

#### **Compressed Sparse Row Format**

• Storage for Large-scale Hamiltonian Matrices.

# **Strategy for Porting to FPGA**

#### **Core Module for SpMV**



#### **Computing Strategy**

- All the multiplications are performed on a FPGA side
- Need to create buffers to store matrix, in/out vectors
- Involves copy in/out to /from FPGA to get the correct results

OpenCL Host Code →

| <pre>clGetPlatformIDs(1,&amp;platform, NULL);<br/>clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &amp;device, NULL);<br/>context = clCreateContext(NULL, 1, &amp;device, NULL, NULL, NULL);<br/>queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL);<br/>size_t binary_size;<br/>unsigned char *binary_data = loadBinaryFile("MVMul_fpga.aocx", &amp;binary_size);<br/>program = clCreateProgramWithBinary(context, 1, &amp;device, &amp;binary_size,</pre>                                                                                                                                                             | Select FPGA device<br>and load kernel                                                                                                                 |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>memMatrixRow = clCreateBuffer(context, CL_MEM_READ_ONLY,</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Create buffer for<br>1) row/column index (RO)<br>2) A matrix (RO)<br>3) X vector (RO)<br>4) Y result vector<br>(WO)<br>RO : Read-Only WO : Write-Only |
| <pre>clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &amp;memMatrixRow);<br/>clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &amp;memMatrixCol);<br/>clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &amp;memMatrixReal);<br/>clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *) &amp;memMatrixImaginary);<br/>clSetKernelArg(kernel, 4, sizeof(cl_mem), (void *) &amp;memVectorReal);<br/>clSetKernelArg(kernel, 5, sizeof(cl_mem), (void *) &amp;memVectorImaginary);<br/>clSetKernelArg(kernel, 6, sizeof(cl_mem), (void *) &amp;memResultReal);<br/>clSetKernelArg(kernel, 7, sizeof(cl_mem), (void *) &amp;memResultImaginary);</pre> | Set kernel argument<br>(parameters of func. Call)                                                                                                     |

# Strategy for Porting to FPGA (Cont.)

#### **Core Module for SpMV**



#### **Computing Strategy**

- All the multiplications are performed on a FPGA side
- Need to create buffers to store matrix, in/out vectors
- Involves copy in/out to /from FPGA to get the correct results

OpenCL Host Code → (con't)



# Strategy for Porting to FPGA (Cont.)

#### **Core Module for SpMV**



#### **Computing Strategy**

- All the multiplications are performed on a FPGA side
- Need to create buffers to store matrix, in/out vectors
- Involves copy in/out to /from FPGA to get the correct results

OpenCL Kernel Code  $\rightarrow$ 



## **Environment for Development**



## Setup condition

- Intel FPGA SDK for OpenCL ver. 16.0.2
- Quartus Prime Pro 16.0
- AAL(Accelerator Abstract Layer) Kernel 5.0.2
- OpenCL BSP(board support package) 1.0

## Test environment

- Convergence criterion: 10<sup>-8</sup>eV
- Maximum # of iterations: 10<sup>4</sup>
- Mission: find 10 lowest conduction band energy levels of Si:P QDs

| kistieubuntuus (vark (balla varld (balla v                                          | wild/hint /hast                                    |
|-------------------------------------------------------------------------------------|----------------------------------------------------|
| <pre>kisti@ubuntu:~/work/hello_world/hello_wo<br/>Querying platform for info:</pre> | oria/bin\$ ./nost                                  |
|                                                                                     |                                                    |
| CL_PLATFORM_NAME                                                                    | = Altera SDK for OpenCL                            |
| CL_PLATFORM_VENDOR                                                                  | = Altera Corporation                               |
| CL_PLATFORM_VERSION                                                                 | = OpenCL 1.0 Altera SDK for OpenCL, Version 16.0.2 |
| Querying device for info:                                                           |                                                    |
|                                                                                     |                                                    |
| CL_DEVICE_NAME                                                                      | <pre>= bdw_fpga_v1.0 : BDW FPGA OpenCL BSP</pre>   |
| CL_DEVICE_VENDOR                                                                    | = Intel Corp                                       |
| CL_DEVICE_VENDOR_ID                                                                 | = 4466                                             |
| CL_DEVICE_VERSION                                                                   | = OpenCL 1.0 Altera SDK for OpenCL, Version 16.0.  |
| CL_DRIVER_VERSION                                                                   | = 16.0                                             |
| CL_DEVICE_ADDRESS_BITS                                                              | = 64                                               |
| CL_DEVICE_AVAILABLE                                                                 | = true                                             |
| CL_DEVICE_ENDIAN_LITTLE                                                             | = true                                             |
| CL_DEVICE_GLOBAL_MEM_CACHE_SIZE                                                     | = 32768                                            |
| CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE                                                 | = 0                                                |
| CL_DEVICE_GLOBAL_MEM_SIZE                                                           | = 281474976710656                                  |
| CL_DEVICE_IMAGE_SUPPORT                                                             | = true                                             |
| CL_DEVICE_LOCAL_MEM_SIZE                                                            | = 16384                                            |
| CL_DEVICE_MAX_CLOCK_FREQUENCY                                                       | = 1000                                             |
| CL_DEVICE_MAX_COMPUTE_UNITS                                                         | = 1                                                |
| CL_DEVICE_MAX_CONSTANT_ARGS                                                         | = 8                                                |
| CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE                                                  | = 70368744177664                                   |
| CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS                                                  | = 3                                                |
| CL_DEVICE_MEM_BASE_ADDR_ALIGN                                                       | = 8192                                             |
| CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE                                                  | = 1024                                             |
| CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR                                               | = 4                                                |
| CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT                                              | = 2                                                |
| CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT                                                | = 1                                                |
| CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG                                               | = 1                                                |
| CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT                                              | = 1                                                |
| CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE                                             |                                                    |
| Command queue out of order?                                                         | = false                                            |
| Command queue profiling enabled?                                                    | = true                                             |
| Using AOCX: hello_world.aocx                                                        |                                                    |

# **Optimization with SIMD**

{

#### **Cannot be directly applied: thread ID dependency**



```
kernel void MVMul fpga(
  ___global const int * restrict pMatrixRow,
  ____global const int * restrict pMatrixColumn,
  global const double * restrict pMatrixReal,
  ___global const double * restrict pMatrixImaginary.
  _____global const double * restrict pVectorReal,
  ____global const double * restrict pVectorImaginary,
                double * restrict pResultReal,
  __global
                 double * restrict pResultImaginary)
  __global
  unsigned int tid = get global id(0);
  unsigned int nSubStart = pMatrixRow[tid];
  unsigned int nSubEnd = pMatrixRow[tid+1];
  unsigned int nColIndex;
  double m_real, m_imaginary, v_real, v_imaginary;
  double real sum = 0.0, imaginary sum = 0.0;
  for(unsigned int j=nSubStart; j<nSubEnd; j++) {</pre>
      nColIndex = \pMatrixColumn[j];
      m_real = pMatrixReal[j];
      m_imaginary = pMatrixImaginary[j];
      v real = pVectorReal[nColIndex];
      v_imaginary = pVectorImaginary[nColIndex];
      real sum += m real * v real - m imaginary * v imaginary;
      imaginary sum += m real * v imaginary + m imaginary * v real;
  pResultReal[tid] = real_sum;
  pResultImaginary[tid] = imaginary_sum;
```

#### (Warning Message)

Compiler Warning: Kernel Vectorization: branching is thread ID dependent ... cannot vectorize.

# Changing computing logic to avoid dependency on Thread ID

- Encounters another problem
  - $\rightarrow$  Logic is too complicate to be fit to FPGA

| ;    | Estimated Resource Usage Summary                                                       |       |                                  |     |    |
|------|----------------------------------------------------------------------------------------|-------|----------------------------------|-----|----|
| ;    | Resource                                                                               | +     | Usage                            | ·   |    |
| ;;;; | Logic utilization<br>ALUTs<br>Dedicated logic registers<br>Memory blocks<br>DSP blocks | ;;;;; | 114%<br>49%<br>66%<br>35%<br>14% | Fai | !! |
| -    |                                                                                        | -+    |                                  | ·;  |    |

What about employing more computing unit then?

## **Optimization with copying "compute units"**



#### Number of "compute units"

1.

- The maximum copy units is 6 for this code
- Compilation error (logic util > 100%) w/ 7 units)

#### Report for num\_compute\_units(6)

| ; Estimated Resource Usage Summary |         |  |  |
|------------------------------------|---------|--|--|
| ; Resource                         | + Usage |  |  |
| ; Logic utilization                | ; 93%   |  |  |
| ; ALUTs                            | ; 40%   |  |  |
| ; Dedicated logic registers        | ; 53%   |  |  |
| ; Memory blocks                    | ; 87%   |  |  |
| ; DSP blocks                       | ; 18%   |  |  |
| +                                  |         |  |  |

#### Report for num\_compute\_units(7)

| ; Estimated Resource Usage Summary                                                               |                                            |  |
|--------------------------------------------------------------------------------------------------|--------------------------------------------|--|
| ; Resource                                                                                       | + Usage                                    |  |
| ; Logic utilization<br>; ALUTs<br>; Dedicated logic registers<br>; Memory blocks<br>; DSP blocks | ; 100%<br>; 43%<br>; 57%<br>; 98%<br>; 19% |  |

```
_attribute__((num_compute_units(6)
  kernel void MVMul_fpga(
    __global const int * restrict pMatrixRow,
    ____global const int * restrict pMatrixColumn,
    global const double * restrict pMatrixReal,
    global const double * restrict pMatrixImaginary,
    ____global const double * restrict pVectorReal,
    global const double * restrict pVectorImaginary,
    __global
                  double * restrict pResultReal,
   __global
                  double * restrict pResultImaginary)
    unsigned int tid = get_global_id(0);
    unsigned int nSubStart = pMatrixRow[tid];
    unsigned int nSubEnd = pMatrixRow[tid+1];
    unsigned int nColIndex;
    double m_real, m_imaginary, v_real, v_imaginary;
    double real sum = 0.0, imaginary sum = 0.0;
    for(unsigned int j=nSubStart; j<nSubEnd; j++) {</pre>
        nColIndex = pMatrixColumn[j];
        m real = pMatrixReal[j];
        m_imaginary = pMatrixImaginary[j];
        v_real = pVectorReal[nColIndex];
        v_imaginary = pVectorImaginary[nColIndex];
        real sum += m real * v real - m imaginary * v imaginary;
        imaginary sum += m real * v imaginary + m imaginary * v real;
    pResultReal[tid] = real sum;
    pResultImaginary[tid] = imaginary sum;
}
```

# **Optimization with Memory Pinning**

IPCC KISTI

Non-pinned memory Pinned memory Memory Type

- A well known technique to accelerate data transfer between host and device(FPGA)
- Data transfer is performed implicitly when pinned memory is used

```
double *A;
cl_mem memA = clCreateBuffer(context,CL_MEM_ALLOC_HOST_PTR,
    size, NULL, NULL);
A = (double *) clEnqueueMapBuffer(queue, memA, CL_TRUE,
    CL_MAP_READ, 0, size, 0, NULL, NULL, NULL);
```

## **Conclusion and Future Plan**



- Implementation OpenCL kernel code for the built-in FPGA device
- Two strategies for performance improvement
  - Duplication of compute unit : up-to 6, ~1.58x speed-up
  - Utilization of pinned memory : ~1.64x bandwidth
- SpMV code with Altera Dynamic Profiler for OpenCL to analyze
  - Memory/channel/pipe accesses
  - SIMD usage

| Board<br>Global Memory EW (MEMORY) |                                                                                                                            | 66%_fpga_v1.0<br>25800 М8/s                |              |                    |                                                                |
|------------------------------------|----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|--------------|--------------------|----------------------------------------------------------------|
|                                    |                                                                                                                            |                                            |              |                    |                                                                |
| File Name                          | Directory                                                                                                                  |                                            |              |                    |                                                                |
| MVMul_fpgi                         | a.cl //home/kisti/work/QAND-3D_for_fpg                                                                                     | a_multi_compute_units/de                   | vice_cu4_pro | file/MVMul_fpga    | .cl                                                            |
|                                    |                                                                                                                            |                                            |              |                    |                                                                |
| Line #                             | Source: MVMul_fpga.cl                                                                                                      | Attributes                                 | Stall%       | Occupancy%         | Bandwidth                                                      |
| 1                                  | #pragma OPENCL EXTENSION cl_khr_fp64 : enable                                                                              |                                            |              |                    |                                                                |
| 2                                  |                                                                                                                            |                                            |              |                    |                                                                |
| 3                                  | _attribute_((num_compute_units(4)))                                                                                        |                                            |              |                    |                                                                |
| 4                                  | kernel void MVMul_fpga(                                                                                                    |                                            |              |                    |                                                                |
| 5                                  | _global const int * restrict pMatrixRow,                                                                                   |                                            |              |                    |                                                                |
| 6                                  | global const int * restrict pMatrixColumn,                                                                                 |                                            |              |                    |                                                                |
| 7                                  | _global const double * restrict pMatrixReal,                                                                               |                                            |              |                    |                                                                |
| 8                                  | _global const double * restrict pMatrixImaginary.                                                                          |                                            |              |                    |                                                                |
| 9                                  | _global const double * restrict pvectorReal,                                                                               |                                            |              |                    |                                                                |
| 10                                 | _global const double * restrict pvectorimaginary.                                                                          |                                            |              |                    |                                                                |
| 11                                 | global double * restrict pResultReal,                                                                                      |                                            |              |                    |                                                                |
| 12                                 | _global double * restrict pResultImaginary)                                                                                |                                            |              |                    |                                                                |
| 13                                 | {                                                                                                                          |                                            |              |                    |                                                                |
| 14                                 | int tid = get_global_id(0);                                                                                                |                                            |              |                    |                                                                |
| 15                                 | int nSubStart = pMatrixRow[tid]:                                                                                           | (_global{MEMORY}.re                        | (14.5%)      | (0.4%)             | (26.2MB/s, 81.68%Efficiency)                                   |
| 16                                 | int nSubEnd = pMatrixRow[tid+1];                                                                                           |                                            |              |                    |                                                                |
| 17                                 | double real_sum = 0.8, imaginary_sum = 8.0;                                                                                |                                            |              |                    |                                                                |
| 18                                 |                                                                                                                            |                                            |              |                    |                                                                |
| 19                                 | for(unsigned int j=nSubStart;   <nsubend; j++)="" td="" {<=""><td></td><td>(</td><td>(2.0.0.1)</td><td></td></nsubend;>    |                                            | (            | (2.0.0.1)          |                                                                |
| 20                                 | double m_real = pMatrixReal[j]:                                                                                            | (_global {MEMORY}.re                       |              | (12.9%)            | (4117.6MB/s, 19.83%Efficiency)                                 |
| 21<br>22                           | double m_imaginary = pMatriximaginary[j];                                                                                  | (global{MEMORV},re                         | (55.71%)     | (12.9%)            | (4117.6MB/s, 19.83%Efficiency)                                 |
| 22                                 | intercelledes attests column 10.                                                                                           | ( elebel (MTMORV) en                       | (40.714))    | (1.7.0%)           | (3870 BMR/s, 10 55% Efficiency)                                |
| 23<br>24                           | int nColindex = pMatrixColumn[j];                                                                                          | (global {MEMORV}, re                       |              | (12.9%)<br>(12.9%) |                                                                |
| 24<br>25                           | double v_real = pVectorReal[nColindex];<br>double v_imaginary = pVectorimaginary[nColindex];                               | (_global{MEMORY}.re<br>( global{MEMORY}.re |              | (12.9%)            | (97.0MB/s, 100.00%Efficiency)<br>(97.0MB/s, 100.00%Efficiency) |
| 25                                 | dooble v_imaginary = pvectorimaginary(hcblindeid)                                                                          | "_guoda [MEMORY], PE                       | (74.94%)     | (12.0%)            | (er, umbrs, 100.00%Emclency)                                   |
| 20                                 | real sum +- m real *v real - m imaginary *v imaginary:                                                                     |                                            |              |                    |                                                                |
| 28                                 | imaginary sum += m_real *v_real - m_imaginary *v_imaginary:<br>imaginary sum += m_real *v_imaginary + m_imaginary *v_real; |                                            |              |                    |                                                                |
| 28<br>29                           | <pre>imaginary_sum += m_rear + v_imaginary + m_imaginary + v_rea;</pre>                                                    |                                            |              |                    |                                                                |
| 30                                 | pResultReal(tid) = real sum;                                                                                               | ( global {MEMORY}, wri                     | (0.04%)      | (0.4%)             | (40.9MB/s, 52.32%Efficiency)                                   |
| 30                                 | pResultimaginary[tid] = imaginary sum;                                                                                     | (_global {MEMORY},wri                      |              | (0.4%)             | (40.9MB/s, 52.32%Efficiency)<br>(40.9MB/s, 52.32%Efficiency)   |
| 32                                 | presolutinaginary(uo) = maginary_sum;                                                                                      | <pre>igooar(mEMORY).wn</pre>               | . (0.01%)    | (0.470)            | (40.8mb/s, 32.52%Emclency)                                     |

## **Questions? / Comments?**



# Thank you!