# Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

Hoon Ryu, Ph.D. (E: <u>elec1020@kisti.re.kr</u>)

Principal Researcher / Korea Institute of Science and Technology Information (KISTI) Principal Investigator / KISTI Intel® Parallel Computing Center









# **Electronic Structure Calculations**

## An application that has customers in a a HUGE range



- The state of motion of electrons in an electrostatic field created by the stationary nuclei.
- Quantum Physics; Quantum Chemistry; Nanoscale Materials and Devices
  - $\rightarrow$  Physics, Chemistry, Materials Science, Electrical Engineering and Mechanical Engineering ETC.
- Huge customers in the society of computational science



## 

Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

# **Electronic Structure Calculations**

In a perspective of "numerical analysis"

- Two PDE-coupled Loop: <u>Schrödinger Equation</u> and <u>Poisson Equation</u>
- Both equation involve system matrices (Hamiltonian and Poisson)
  - $\rightarrow$  DOFs of those matrices are proportional to the # of grids in the simulation domains



- Schrödinger Equations
  - $\rightarrow$  Normal Eigenvalue Problem  $H\Psi = E\Psi$
- Poisson Equations
  - → Linear System Problem  $-\nabla(\epsilon \nabla V) = \rho \rightarrow Ax = b$



#### Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

# **Electronic Structure Calculations**

In a perspective of "numerical analysis"

- Two PDE-coupled Loop: <u>Schrödinger Equation</u> and <u>Poisson Equation</u>
- Both equation involve system matrices (Hamiltonian and Poisson)
  - $\rightarrow$  DOFs of those matrices are proportional to the # of grids in the simulation domains



Schrödinger Equations

Poisson Equations

- $\rightarrow$  <u>Normal Eigenvalue Problem</u>
- $H\Psi = E\Psi$
- $\rightarrow$  Linear System Problem  $-\nabla(\varepsilon \nabla V) = \rho \rightarrow A = b$

How large are these system matrices? Why do we need to handle those?



# **Needs for "Large" Electronic Structures**

## **Needs for High Performance Computing**



1. Quantum Simulations of "Realizable" Nanoscale Materials and Devices

 $\rightarrow$  Needs to handle large-scale atomic systems (~ A few tens of nms)



Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

# **Development Strategy: DD, Matrix Handling**

## Large-scale Schrödinger, Poisson Eqns.



| Schrödinger Equation<br>• Normal Eigenvalue Problem (Electronic Structure)<br>• Hamiltonian is always symmetric $H\Psi = E\Psi$ |                                    |                    |        | Poisson Equation• Linear System Problem (Electrostatics: Q-V)• Poisson matrix is always symmetric $-\nabla(\epsilon \nabla V) = \rho$ |
|---------------------------------------------------------------------------------------------------------------------------------|------------------------------------|--------------------|--------|---------------------------------------------------------------------------------------------------------------------------------------|
| <text><list-item><list-item></list-item></list-item></text>                                                                     | Adiag(0) W(0,1)<br>W(1,0) Adiag(1) | W(1,2)<br>Adiag(2) | W(2,3) | 4 TB Hamiltonian Onsite Coupling<br>(3) (3,4) (3,7)                                                                                   |

#### Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

# **Development Strategy: Numerical Algorithms**

## **Schrödinger Equations**



 $= E\Psi$ 



#### Schrödinger Eqs. w/ LANCZOS Algorithm

- → C. Lanczos, J. Res. Natl. Bur. Stand. 45, 255
- Normal Eigenvalue Problem (Electronic Structure)
- Hamiltonian is always symmetric

Original Matrix 
$$\rightarrow$$
 T matrix-reduction  $H\Psi$ 

• Steps for Iteration: Purely Scalable Algebraic Ops.

$$\mathbf{v}_{i}: (Nx1) \text{ vectors } (i = 0, ..., \mathbf{K}); \mathbf{a}_{i} \text{ and } \mathbf{b}_{i}: \text{ scalars } (i = 1, ..., \mathbf{K})$$
$$\mathbf{v}_{0} \leftarrow \mathbf{0}, \mathbf{v}_{1} = \text{ random vector with norm } 1;$$
$$\mathbf{b}_{1} \leftarrow \mathbf{0};$$
$$\text{loop for } (j=1; j \leq \mathbf{K}; j++)$$
$$\mathbf{w}_{j} \leftarrow \mathbf{Av}_{j};$$
$$\mathbf{a}_{j} \leftarrow \mathbf{w}_{j} \cdot \mathbf{v}_{j};$$
$$\mathbf{w}_{j} \leftarrow \mathbf{w}_{j} - \mathbf{a}_{j}\mathbf{v}_{j} - \mathbf{b}_{j}\mathbf{v}_{j-1};$$
$$\mathbf{b}_{j+1} \leftarrow ||\mathbf{w}_{j}||;$$
$$\mathbf{v}_{j+1} \leftarrow ||\mathbf{w}_{j}||;$$
$$\mathbf{v}_{j+1} \leftarrow \mathbf{w}_{j} / \mathbf{b}_{j+1};$$
$$\text{construct T matrix;}$$
$$\mathbf{r} = \begin{pmatrix} a_{1} & b_{2} & 0 & \cdots & 0 \\ b_{2} & a_{2} & b_{3} & \cdots & 0 \\ b_{2} & a_{2} & b_{3} & \cdots & \cdots & 0 \\ 0 & b_{3} & \ddots & \ddots & 0 \\ \vdots & \cdots & \cdots & b_{k-1} & 0 \\ \vdots & \cdots & b_{k-1} & a_{k-1} & b_{k} \\ 0 & \cdots & \cdots & 0 & b_{k} & a_{k} \end{pmatrix}$$

# **Development Strategy: Numerical Algorithms**

## **Poisson Equations**





#### Poisson Eqs. w/ CG Algorithm

- $\rightarrow$  A Problem of Solving Linear Systems
- Conv. Guaranteed: Symmetric & Positive Definite
- Poisson is always S & PD.

 $-\nabla(\varepsilon\nabla V) = \rho$ 

 Steps for Iteration: Purely Scalable Algebraic Ops.

We want to solve 
$$Ax = b$$
. First compute  $\mathbf{r}_0 = \mathbf{b} - Ax_0$ ,  $\mathbf{p}_0 = \mathbf{r}_0$   
loop for (j=1; j<=K; j++)  
 $\mathbf{a}_j \leftarrow \langle \mathbf{r}_j \bullet \mathbf{r}_j \rangle / \langle A\mathbf{p}_j \bullet \mathbf{p}_j \rangle$ ;  
 $x_{j+1} \leftarrow x_j + \mathbf{a}_j \mathbf{p}_j$ ;  
 $\mathbf{r}_{j+1} \leftarrow \mathbf{r}_j - \mathbf{a}_j A\mathbf{p}_j$ ;  
if ( $||\mathbf{r}_{j+1}|| / ||\mathbf{r}_0|| < e$ )  
declare  $\mathbf{r}_{j+1}$  is the solution of  $Ax = b$  and break the loop  
 $\mathbf{c}_j \leftarrow \langle \mathbf{r}_{j+1} \bullet \mathbf{r}_{j+1} \rangle / \langle \mathbf{r}_j \bullet \mathbf{r}_j \rangle$ ;  
 $\mathbf{p}_{j+1} \leftarrow \mathbf{r}_{j+1} + \mathbf{c}_j \mathbf{p}_j$ ;  
end loop

# **Performance Bottleneck?**

## **Matrix-vector Multiplier: Sparse Matrices**



**Vector Dot-Product (VVDot)** 



#### **Main Concerns for Performance**

- Collective Communication
  - → May not be the main bottleneck as we only need to collect a single value from a single MPI process
- Matrix-vector Multiplier
  - → Communication happens, but would not be a critical problem as it only happens between adjacent ranks
  - $\rightarrow$  Cache-miss affects vectorization efficiency

#### (Sparse) Matrix-vector Multiplier (MVMul)



# **Performance w/ Intel® KNL Processors**

## Single-node Performance: The power of MCDRAM





#### Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

 $2^{8}$ 

# **Performance w/ Intel® KNL Processors**

## Single-node Performance: Speed

#### **Description of BMT Target and Test Mode**

- 10 CB states in 16x43x43(nm<sup>3</sup>) [100] Si:P quantum dot
  - → Material candidates for Si Quantum Info. Processors (Nature Nanotech. 9, 430)
  - $\rightarrow$  15.36Mx15.36M Hamiltonian Matrix
- KNL (Xeon Phi 7210); Up to 256 (64x4) cores; Quad Mode
- Specs of Other Platforms
  - $\rightarrow$  Xeon(V4): 24 cores of Broadwell (BW) 2.50GHz
  - → Xeon(V4)+KNC: 24 cores BW + 2 KNC 7120 cards
  - → Xeon(V4)+P100: 24 cores BW + 2 P100 cards
  - $\rightarrow$  KNL(HBM): the one described so far

#### Results

- KNL slightly beats Xeon(V4)+P100
  - $\rightarrow$  Copy-time (CPIN): a critical bottleneck of PCI-E devices
  - → P100 shows better kernel speed, but the overall benefit reduces due to data-transfer between host and devices
  - $\rightarrow$  CPIN would even increase if we consider periodic BCs
- Another critical figure of merit: Energy-efficiency





# Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

# **Performance w/ Intel® KNL Processors**

# Single-node Performance: Energy Consumption







#### Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors

# **Performance w/ Intel® KNL Processors**

## **Multi-node Performance: Scalability**

#### **Description of BMT Target and Test Mode**

- 5 CB states in 27x33x33(nm<sup>3</sup>) [100] Si:P quantum dot
  - → Material candidates for Si Quantum Info. Processors (Nature Nanotech. 9, 430)
  - $\rightarrow$  14.4Mx14.4M Hamiltonian Matrix
- KNL (Xeon Phi 7250) nodes; Up to 272 (68x4) cores/node
  → (4 MPI processes + 68 threads) per node
  - $\rightarrow$  Quad / Flat mode, No omni-path (10G network)
  - $\rightarrow$  Strong scalability measured up to 3 nodes

### Results

#### H. Ryu, Intel® HPC Developer Conference (2017)

- Speed-enhancement with HBM becomes larger as a single node takes larger workload
- Inter-node strong scalability is quite nice with no HBM
  2.26x anod up with 2x computing expanse (no HBN)
  - $\rightarrow$  2.36x speed-up with 3x computing expense (no HBM)
  - → ~78% scalability (2.36/3) is what we usually get from multi-core base HPCs (Tachyon-II HPC in KISTI)
  - $\rightarrow$  1.58x speed-up with HBM (prob. size is not large enough)







## **KISTI Intel® Parallel Computing Center**

- Introduction to Code Functionality
- Main Numerical Problems and Performance Bottleneck
- Performance (speed and energy consumption) in a single KNL node
- Strong scalability in multiple KNL nodes
- (Appendix) Strategies for offload computing (for PCI-E devices)

# Thanks for your attention!!



# Appendix: Strategy for offload-computing

Asynchronous Offload (for Xeon(V4) + KNC, Xeon(V4) + GPU)

The real bottleneck of computing: Overcome with asynchronous offload <u>H. Ryu et al., Comp. Phys. Commun. (2016)</u> (http://dx.doi.org/10.1016/j.cpc.2016.08.015)

- Vector dot-product is not expensive: All-reduce, but small communication loads
- · Vector communication is not a big deal: only communicates between adjacent layers
- Sparse-matrix-vector multiplication is a big deal: Host and PCI-E device shares computing load



Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors



# **Appendix: Strategy for offload-computing**

**Data-transfer and Kernel Functions for GPU Computing** 

#### Data-transfer between host and GPU Devices

- 3x increased bandwidth with pinned memory
- Overlap of computation and data-transfer with • asynchronous streams

## Speed-up of GPU Kernel Function (MVMul)

Treating several rows at one time with WARPs





#### [Asynchronous Data Transfer with Pinned Memory]



# [Synchronous Data Transfer with Pageable Memory]

Hoon Ryu / Energy-efficient, scalable computing of extremely large electronic structures with KNL processors



H. Ryu et al., J. Comp. Elec. (2018)

(http://dx.doi.org/10.1007/s10825-018-1138-4)