Title: FPGA Datacenters for HyperScale HPC
Speaker: Andrew Putnam
Process technology improvements have historically allowed an effortless expansion of the capacity and capabilities of computers and the cloud with few changes to the underlying software or programming model. However, the end of Dennard Scaling means that performance and efficiency gains will rely on the customization of the hardware for each application. Yet customizing hardware for each application runs contrary to the trend of moving more and more applications to a common hardware infrastructure – the Cloud.
Microsoft’s Catapult project has brought the power and performance of FPGA-based reconfigurable computing to hyperscale datacenters, accelerating major production cloud applications such as Bing web search and Microsoft Azure, and enabling a new generation of machine learning and artificial intelligence applications. These diverse workloads are accelerated using the same underlying hardware by using highly programmable silicon.
The presence of ubiquitous and programmable silicon in the datacenter enables a new era of hardware/software co-design across a wide variety of workloads, opening up affordable and efficient performance across an enormous set of workloads. Catapult is now deployed in nearly every new server across the more than a million machines that make up the Microsoft hyperscale cloud.
In this talk, I will describe the Catapult configurable cloud architecture, why we chose this architecture, and the tools and techniques that have made Catapult successful to date. I will discuss areas where traditional hardware and software development flows fall short, the domains where this programmable hardware holds the most potential, and how this technology can enable new computing frontiers.
Andrew Putnam is a Principal Hardware Development Engineer in a collaboration between Microsoft Azure and Microsoft AI & Research. He is currently leading the Azure SmartNIC FPGA team, which is responsible for AccelNet, the fastest public cloud network in the world. Andrew received a dual B.A/B.S. from the University of San Diego in 2003, and an M.S. and Ph.D. in Computer Science and Engineering from the University of Washington in 2006 and 2009 respectively. His research focuses on reconfigurable computing, future datacenter design, and computer architecture, with an emphasis on seeing research through from concept to prototype to technology transfer. He was a founding member of the Microsoft Catapult project, which was the first to put Field Programmable Gate Arrays (FPGAs) into production hyperscale datacenters, doubling the capacity of each server for web search, and creating the fastest network in the cloud. Catapult has been deployed in nearly ever server Microsoft has purchased since 2015, and it forms the foundation of Microsoft’s cloud network, storage, and AI acceleration efforts.
I will discuss about characteristics of cutting-edge FPGA high-level synthesis tools (e.g., Intel FPGA SDK for OpenCL, Intel HLS) and performance optimization strategies. I also will outline key software challenges in the adoption of FPGA into future FPGA-enabled heterogeneous platforms for edge and HPC.
In designing data-flow based custom hardware for FPGA, we have to optimize mapping of operators to hardware resources under the trade-off between resource consumption and operating frequency. However, the design space is too large to find optimal solution with a brute force approach. In this talk, I introduce the hardware optimization as a combinatorial optimization problem, and our preliminary development of an automatic optimization tool.
I report the progress of the development and performance results of optimized hardware.
Title:Computation/Communication offloading to FPGA with GPU
Speaker: Taisuke Boku
Recent FPGA such as Intel Arria10/Stratix10 or Xiling UltraScale is provided a feature of high speed external optical link with up to 100Gbps performance. Moreover, some FPGAs have up to 4 channels of these links to provide 400Gbps which exceeds the performance of InfiniBand EDR or HDR. Then, it is natural idea to utilize them under combination of computation to realize highly efficient and strongly scalable multi-node FPGA system.
In CCS at University of Tsukuba, we have been developing a multi-hybrid cluster system to combine CPU, GPU and FPGA for all-round HPC. GPU is the correctly most used accelerator for highly parallel data parallel computation with high bandwidth of memory, however it is not suitable for lower parallelism nor irregular processing with non-SIMD behavior. We cover this GPU characteristics by FPGA to provide efficient processing and supporting data communication by its own links. The bottleneck here is that there is PCIe gen3 x16 lanes as the only communication channel between GPU and FPGA on the node.
In this talk, I will give our research summary to enable OpenCL user code to enjoy the high speed inter-FPGA communication as well as GPU-FPGA communication to support our concept of “AiS: Accelerator in Swtich” to exploit the maximum utilization of today’s FPGA features.
Our first application is computational astrophysics for research on early stage universe based on radiative transfer and gravity.
Title:Developing applications with OmpSs@FPGA
Speaker: Xavier Martorell
In this talk, we will show the offloading techniques developed on the OmpSs programming model for GPUs working with CUDA and OpenCL, and the newly developed support for offloading onto FPGAs. We will compare them, and highlight the benefits and drawbacks of each approach. We will also present the results obtained from Matrix Multiplication on the AXIOM board, based on a Zynq Ultrascale+ chip, where our environment achieves 40 GFlops, running a combination of tasks on the FPGA and the SMP cores.
Title: Case Studies: HPC Acceleration with Intel FPGAs
Speaker: Brad Kadet
This session will introduce Intel’s FPGA acceleration ecosystem for HPC, including data analytics and machine learning, through several case studies. Offering flexibility, deterministic low latency, and high power efficiency, FPGAs present an ideal platform for accelerating applications such as RDB/NoSQL, Spark, genomics, and financial transactions. A complete acceleration solution, however, must include not only silicon, but software, IP, and board infrastructure to enable end-to-end design.
Description: The 500 club (Top500, Graph500, Green500, etc) has a huge influence to HPC vendors and communities. Today, CPU or GPU-based systems are dominating those ranking lists; only a few FPGA-based systems have been ranked in some 500 lists previously. Due to reconfigurability, the performance of FPGA systems largely depend on the implementation quality of FPGA designs, thus benchmarking FPGAs is elusive. The goal of this panel session is to discuss benchmarking methodology for FPGAs and reconfigurable computers, sketch out reconf500 (*1) benchmark suite, and establish a technical working group to prepare a proposal to recognizable venues such as SC bird-of-feather as a follow-up.
*1 reconf500 is a tentative name. Suggestions are welcome!
Topics include, but not limited to:
– What workloads (simulation, data, learning, edge/fog, cloud)?
– What components (as computing accelerator, network accelerator, storage accelerator)
– What programming models (OpenMP,OpenCL,OpenACC,HSL or HDL)?
– What metrics and how to aggregate measurements?
– How to compare with CPU and GPU?
– How to measure the benefit of reconfigurability?
– Reference results (we need some initial starting points): who is interested to contribute?
– Reference FPGA platforms
One of potential risks on benchmarking FPGAs is that we tend to position FPGAs as simple accelerators, which can potentially obscure real potential of FPGAs (e.g., network accelerators, data flow processors). Inputs from panelists and attendees on how we enhance FPGA-ness by benchmarking would be a nice addition to this session.
In terms of the format, we start off the panel session with short presentations by panelists (approximately 5 minutes each), and then move into interactive discussion.