

## **Open Source HPC Hardware**

## **John Shalf**

David Donofrio, Farzad Fatollahi-Fard

Lawrence Berkeley National Laboratory

### **CCS/LBL Collaboration Meeting**

University of Tsukuba, March 5, 2018





Office of Science

### Technology Scaling Trends Exascale in 2021... and then what?





COMPUTER

EXASCALE DESIGN SPACE EXPLORATION

ARCHITECTURE LABORATORY



## **Beyond Moore Computing Taxonomy**





#### Future of Computing ARCHITECTURE EXASCALE DESIGN SPACE EXPLORATION



**rrrrr** BERKELEY LAB

COMPUTER

LABORATORY

COMPUTER ARCHITECTURE LABORATORY EXASCALE DESIGN SPACE EXPLORATION





#### ARCHITECTURE LABORATORY ALE DESIGN SPACE EXPLORATIC EXTREME HW Specialization Happening Now

This trend is already well underway in broader electronics industry Cell phones and even megadatacenters (Google TPU, Microsoft FPGA accel) (and it will happen to HPC too... will we be ready?)





COMPUTER

COMPUTER ARCHITECTURE **Chisel Overview** 

EXASCALE DESIGN SPACE EXPLAND Productive, flexible language for hardware design and simulation

#### Not "Scala to Gates" – it is structural.

- Describe hardware functionality
- Creates graph representation of HW that is flattened as each node translated to Verilog or C++

## Get benefits of modern high level languages

- Inheritance
- Complex Types
- Modularity
- Polymorphism

Hardware Generators are a more efficient technique for generating designs













## **Generator Example: Cache**

|             | <pre>class Cache(cache_type: Int = DIR_MAPPED,</pre> |  |  |  |  |
|-------------|------------------------------------------------------|--|--|--|--|
| Parameters  | line_size: Int = 128,                                |  |  |  |  |
|             | <pre>cache_depth: Int = 16,</pre>                    |  |  |  |  |
|             | write policy: Int = WRITE THRU                       |  |  |  |  |
|             | ) extends Component {                                |  |  |  |  |
|             | <pre>val io = new Bundle() {</pre>                   |  |  |  |  |
| 10          | val cpu = new IoCacheToCPU();                        |  |  |  |  |
| Connections | <pre>val mem = new IoCacheToMem().flip();</pre>      |  |  |  |  |
| Connections | }                                                    |  |  |  |  |
|             | val addr idx width = log2(cache depth).toInt         |  |  |  |  |
|             | val addr off width = log2(line size/32).toInt        |  |  |  |  |
|             | val addr tag width = 32 - addr idx width -           |  |  |  |  |
| Local       | addr off width - 2                                   |  |  |  |  |
| Parameters  |                                                      |  |  |  |  |
|             |                                                      |  |  |  |  |
|             | if (cache type == DIR MAPPED)                        |  |  |  |  |
| Generator   |                                                      |  |  |  |  |
|             |                                                      |  |  |  |  |
| Body        | 5                                                    |  |  |  |  |





### The Rise of Open Source Software: Will Hardware Follow Suit?



- Rapid growth in the adoption and number of open source software projects
- More than 95% of web servers run Linux variants, approximately 85% of smartphones run Android variants
- Will open source hardware ignite the semiconductor industry? Is RISC-V the hardware industry's Linux?

## Why Open Source Hardware?

- More productive and credible hardware research
  - Generate \*real\* hardware that can be measured
  - by Reducing the cost of development (Chisel)
  - by creating and sharing open hardware (RISCV, OpenSOC)
- More innovation
  - Don't need to be a big company to play
  - Engage academic, lab research community in DSE
- Lower Cost / Complexity for Development
  - Share software stack (complete compilers, debug, Linux ports)
  - Focus NRE and license on new/innovative IP blocks
  - Stop squeezing license costs out of items that students can implement in a summer (license \*hard\* stuff)
- Whether future is ASICs or FPGAs, you need more productive hardware generation!!!





### Open source, chisel based processors based on a Open ISA

- RISC-V Core Taxonomy
  - Out-of-order (BOOM)
  - In-Order (Rocket)
  - IoT (Z-Scale, Sodor)
- Can Generate for Different Addressing Widths
  - 32, 64, 128-bit addressing
  - Double precision floating point
  - Vector accelerators
- Complete SW stack available
  - GNU and LLVM compilers
  - Linux implementations







#### COMPUTER ARCHITECTURE OPEN Source" doesn't mean "Low Performance"

Offers a good proxy to commercial cores such as ARM Have complete visibility inside of the core design



| Category                | ARM Cortex A5                     | RISC-V Rocket                  |  |  |
|-------------------------|-----------------------------------|--------------------------------|--|--|
| ISA                     | 32-bit ARM v7                     | 64-bit RISC-V v2               |  |  |
| Architecture            | Single-Issue In-Order 8-<br>stage | Single-Issue In-Order 5-stage  |  |  |
| Performance             | 1.57 DMIPS/MHz                    | 1.72 DMIPS/MHz                 |  |  |
| Process                 | TSMC 40GPLUS                      | TSMC 40GPLUS                   |  |  |
| Area w/o Caches         | 0.27 mm <sup>2</sup>              | 0.14 mm <sup>2</sup>           |  |  |
| Area with 16K<br>Caches | 0.53 mm <sup>2</sup>              | 0.39 mm <sup>2</sup>           |  |  |
| Area Efficiency         | 2.96 DMIPS/MHz/mm <sup>2</sup>    | 4.41 DMIPS/MHz/mm <sup>2</sup> |  |  |
| Frequency               | >1GHz                             | >1GHz                          |  |  |
| Dynamic Power           | <0.080 mW/MHz                     | 0.034 mW/MHz                   |  |  |



#### **Performance Comparison Across RISC-V** CTURE ORATORY **Processor Families and ARM Equivalents**

COMPUTER





## **OpenSoC** Fabric

## An Open-Source, Flexible, Parameterized, NoC Generator

- Part of the CoDEx tool suite
- Written in Chisel
- Dimensions, topology, VCs all configurable
- Fast functional C++ model for functional validation
- Verilog based description for FPGA or ASIC
  - Synthesis path enables accurate power / energy modeling









**Top Level Router Diagram** 







### **OpenSOC: Functional Hierarchy**





## **Design Space Exploration of NoC Hardware**

(all aspects of NOC parameterized, including topology)

### Some common topologies











An analysis of on-chip interconnection networks for large-scale chip multiprocessors ACM Transactions on computer architecture and code optimization (TACO), April 2010



## SC16 Demonstration of Scalable SoC Emulation on FPGAs

A 96 core SoC with local-store and HBC Memory Subsystem

Full RTL implementation of SOC





Shockingly (*but accidentally*) similar to Sunway node architecture

COMPUTER

### 2 people spent 2 months to create





- Z-Scale processors connected in a **Concentrated Mesh**
- **4 Z-scale processors**
- 2x2 Concentrated mesh with 2 virtual channels
- Micron HMC Memory



## **Energy/Power Estimates for Co-design Toolflow**

EXASCALE DESIGN SPACE EXPLORATION

COMPUTER

ABORATORY

IITECTURE







**FPGA vs. ASIC** 

# FPGA



Cost for first FPGA (NRE): Cost for 20,000<sup>th</sup> : Clock Rate: \$1,200-\$3,000 \$1,200-\$3,000 0.1-0.3Ghz



 Cost for first ASIC (NRE):
 \$2M-\$5M

 Cost for 20,000<sup>th</sup> :
 \$100

 Clock Rate:
 1-2 Ghz (10x)

 Area Efficiency:
 10x FPGA

 Energy Efficiency :
 10x-100x FPGA



ARC What are the NRE and MFR Costs?

#### idea of rough cost model) (again: anonymized)

| Parameter                             | Value                                                            | Comp                                                                                                                                                          | onent          |              |               | Amount         |
|---------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------|---------------|----------------|
| Foundry<br>Process                    | TSMC                                                             | Manufacturing NRE<br>Masks, 12 Prototype Char Wafers, 150 Prototypes,<br>Process Eng, Product Eng, Project Management                                         |                |              |               | \$1,512,970    |
| Technology                            | 40nm 1P10M (m1 7x2y) ,<br>RDL, Bump, VTCp, ESD                   | Package NRE<br>Package Tooling and Package Engineering                                                                                                        |                |              |               | \$23,460       |
| Die size<br>estimate                  | 16mm x 16mm                                                      | Test Development NRE<br>Test Engineering and Tester Rental Time                                                                                               |                |              | 9             | \$82,800       |
| IP                                    | See IP Summary Slide                                             | IP NRE                                                                                                                                                        |                |              |               | \$918,000      |
| Package                               | 1156 FCBGA,  3:2:3-layer,<br>35x35 body size, 1 mm<br>ball pitch | IP Licensing Fees, Support and Maintenance<br><b>Characterization NRE</b><br>Char units, Tester rental, Test Engineering,<br>Process Engineering, Char report |                |              |               | \$52,000       |
| Tester<br>Platform                    | Agilent-93K-640-300MHz                                           | Qualification NRE<br>Q&R Engineering, HTOL, TMCL, HTSL, UHAST,                                                                                                |                |              |               | \$225,000      |
| Test Time                             | Wafer Sort: 10s<br>Final Test: 10s                               | ESD & LU<br>Total NRE                                                                                                                                         |                |              |               | \$2,814,230    |
| IP Description                        |                                                                  |                                                                                                                                                               |                |              |               |                |
| PCIe Gen 2 PHY \$200k (could be IB)   |                                                                  |                                                                                                                                                               |                |              | NI ( 40       | A 1 11/1       |
| PCIe Gen 2 End point controller \$80K |                                                                  | First 2.5<br>Ku                                                                                                                                               | Next 2.5<br>Ku | Next 5<br>Ku | Next 10<br>Ku | Additio<br>nal |
| DDR3 PHY \$338K                       |                                                                  | \$121.60                                                                                                                                                      | \$119.76       | \$117.97     | \$87.78       | \$74.80        |
| DDR3 Controller \$100K                |                                                                  | ¥ · <b>—</b> · · • •                                                                                                                                          | ÷ · · · · · ·  | ÷••••        | + - · · · •   | ¥              |

With Marty Deneroff



## See http://www.socforhpc.org/

Multi-agency + industry and academia workshop to flesh out this approach to HPC.

Accelerate how to exploit hardware Customization Beyond Exascale







- Exponentially Increasing Parallelism (central challenge for ECP)
  - **Trend**: End of exponential clock frequency scaling (end of Dennard scaling)
  - **Consequence:** *Exponentially increasing parallelism*
- End of Lithography as Primary Driver for Technology Improvements
  - Trend: Tapering of lithography Scaling
  - **Consequence:** *Many forms of heterogeneous acceleration (not just GPGPUs)*
- Data Movement Heterogeneity and Increasingly Hierarchical Machine
  - **Trend**: Moving data operands costs more than computation performed on them
  - **Consequence**: *More heterogeneity in data movement performance and energy*
- Performance Heterogeneity
  - **Trend**: *Heterogeneous execution rates from contention and power management*
  - **Consequence**: Extreme variability and heterogeneity in execution rates
- Diversity of Emerging Memory and Storage Technologies
  - **Trend**: *Emerging memory technologies and stall in performance improvements*
  - **Consequence**: Disruptive changes to our storage environment



## Are we content to target 2021?









- More credible hardware research
  - Generate \*real\* hardware that can be measured
  - by Reducing the cost of development (Chisel)
  - by creating and sharing open hardware (RISCV, OpenSOC)
- More innovation
  - Don't need to be a big company to play
  - Engage academic, lab research community in DSE
- Lower Cost / Complexity for Development
  - Focus NRE and license on new/innovative IP blocks
  - Stop squeezing license costs out of items that students can implement in a summer (license \*hard\* stuff)



## **The End** For more information go to http://www.cal-design.org/

OpenSOC: <u>http://www.opensocfabric.org/</u> SOCforHPC: <u>http://www.socforhpc.org/</u>





International Symposium on Computer Architecture (ISCA) 2010, 44 accepted papers (18% accept rate)

**Types of papers:** 

- Proposals (no/few numbers): 2 papers
- Modeling techniques: 4 papers
- Real machines (analyzed, or used for eval.): 6 papers
- New device technology (photonics, MTRAM): 4 papers
- Outer memory system (LLC/DRAM/NVM): 9 papers
- Processor or inner-cache mechanisms: 19 papers

