11月12日から17日にかけて米国デンバーで開催される SC23 (The International Conference for High Performance Computing, Networking, Storage, and Analysis; ハイパフォーマンスコンピューティング、ネットワーク、ストレージおよび解析についての国際会議）において、ブース出展（#1133）を行います。
Booth Talk Schedule
Nov. 13 (Mon)
Kengo Nakajima, Toshihiro Hanawa （The University of Tokyo/JCAHPC, Japan）
*Long but “Straight” Road to OFP-II in JCAHPC
JCAHPC, Joint Center for Advanced High Performance Computing, is a collaboration between University of Tsukuba and University of Tokyo since 2013. Oakforest PACS (OFP) with more than 8,000 Intel Xeon Phi is the first system of JCAHPC. At the first appearance in Top 500 November 2016, it was 6th in the world and top in Japan. After retirement of K computer in 2019, OFP worked as the National Flagship System in fact until operation of Fugaku starts in 2020. OFP retired in March 2022. OFP-II, successor of OFP-II, will start its operation in January 2025. It consists of 2 types of node groups, Group-A with general purpose CPUs only, and Group-B with GPUs. Peak performance of Group-B with NVIDIA H100 GPUs will be more than 70 PF. RFP of OFP-II was released in late August 2023, and bid opening takes place immediately before SC23. In this presentation, we will overview the various activities related to OFP-II, and will report the bid opening results.
Estela Suarez （Juelich Supercomputing Centre, Germany）
*Modular Supercomputing Architecture: from DEEP to JUPITER
The variety of applications using or striving to use high performance computing (HPC) resources is growing. The challenge for HPC technology providers and hosting sites consists on coming up with HPC system designs able to cope with such application diversity, while keeping the energy consumption and operational costs at bay. The Modular Supercomputing Architecture (MSA) addresses these challenges by orchestrating diverse resources like CPUs, GPUs, and accelerators at a system level, organizing them into compute modules. Each module is configured to cater to specific application classes and user requirements. Users have the flexibility to determine how many nodes to use within each module, aligning their hardware requirements with their application’s needs. Advanced scheduling mechanisms are used to maximize system utilization. This talk will detail MSA’s key features, hardware, and software components, and discuss MSA’s prospects in the context of Exascale computing.
Jonathan Carter（Lawrence Berkeley National Laboratory, USA）
*Current and Future Computing at Berkeley Lab
US High-Performance Computing at DOE National Laboratories is facing new challenges and opportunities as we enter the Exascale era. There appear to be four major areas: sustaining the exascale HPC ecosystem and improving beyond it; new computing paradigms, such as quantum computing; more tightly coupling computing resources and experimental data; and an ecosystem to enable Artificial Intelligence to be used as a tool for scientific discovery. I will briefly discuss the challenges and opportunities in each.
Nov. 13 (Mon)
*NVIDIA Grace-Hopper superchips for accelerated performance, efficiency, and programmability.
The NVIDIA Grace-Hopper GH200 superchip is designed to address the most complex computational problems at every scale of AI and HPC, from today’s most demanding transformer-based large-language models to the complexity of today’s state-of-the-art industrial and scientific computing applications. This brief presentation will highlight key features of the Grace-Hopper superchip platform, combining the NVDIA Grace GPU and NVIDIA Hopper GPU with the coherent NVLink Chip-to-Chip (C2C) interconnect into a single package. Grace-Hopper improves developer productivity by accelerating their programming model with the full stack of standards-based NVIDIA software ecosystem and advances datacenter sustainability with leading energy efficiency.
Keita Teranishi（Oak Ridge National Laboratory, USA）
*Toward Extreme Heterogeneous Programming Environment with CHARM-SYCL.
Addressing performance portability across a variety of accelerator architectures presents a significant challenge in developing applications and programming systems for high-performance computing environments. Recent programming systems dedicated to performance portability have notably enhanced productivity in response to this challenge. However, the complexity escalates when compute nodes are equipped with multiple types of accelerators, each with its own distinct performance attributes, optimal data layouts, and binary formats. To effectively manage the complexities of programming in multi-accelerator environments, we introduce CHARM-SYCL, a performance portable programming model for extreme heterogeneity. This platform integrates a SYCL-based performance-portability programming frontend with IRIS runtime backend designed for extremely heterogeneous architectures in Oak Ridge National Laboratory. Our initial assessments suggest that CHARM-SYCL offers a promising increase in productivity and maintains reasonable performance when compared to vendor-specific programming systems and runtimes.
Franck Cappello（Argonne National Laboratory, USA）
*AI-augmented SWARM based resilience for Integrate Research Infrastructures
The community spent a dozen years developing production-level checkpointing techniques, such as VeloC, capable of capturing and saving extreme volumes of data with negligible overhead for the Exascale scientific parallel application executions. A novel category of systems will emerge within the next ten years: Integrated Research Infrastructures. These infrastructures will connect supercomputers, scientific instrument facilities, large-scale data repositories, and collections of edge devices to form nationwide execution environments that users will share to run scientific workflows. The characteristics of IRIs and the workflow execution constraints raise a new set of unexplored research questions regarding resilience, especially execution state management. In this talk, we will first review the projected characteristics of IRIs and the user constraints regarding workflow executions. One IRI projected characteristic is the practical difficulty (probable impossibility) of capturing consistent states of the full system: resilience mechanisms will likely need to work only with an approximate system view. To address this unique resilience design characteristic, the DOE-funded SWARM project will explore a novel resilience approach based on AI-augmented distributed agents where each node of the IRIS runs an agent having a view of the system limited to its neighbors. We will review the open research questions raised by this revolutionary approach (fault types, fault detection, fault notification, execution state capture, and management) and some potential directions to address them.