Title: DHARMA: Distributed asyncHronous Adaptive Resilient Management of Applications
講師：Dr. Keita Teranishi, Sandia National Laboratory
Resilience at extreme-scale for high-performance computing (HPC) is now a widely recognized and established concern. As we approach massive billion-thread parallelism and rapidly increasing number of components on the path to exascale, the mean-time between failures (MTBF) will continue to shrink. To ensure forward progress is made as efficiently as possible (both time and energy), new programming model and runtime tools are needed.
To deal with the massive parallelism in the future systems, a number of many-task programming models are emerging as alternatives of MPI-based single-program multiple-data (SPMD) model. They tend to emphasize data-flow by launching a work as soon as the input data is available. The state of the art many-task programming models involve several techniques to optimize scheduling of the tasks as well as the placement of the data to improve load balancing, communication overhead and concurrency. However, these techniques become very challenging to implement once resilience becomes the center of interest.
With these concerns in mind, we propose DHARMA, Distributed asyncHronous Adaptive Resilient Management of Applications. We employ DHT to maintain the meta-data information of data and tasks to quickly recover from fail-stop node crashes. Our DHT allows a quick look-up for the entities of many task programming model including data and task as well as their duplications.
In the talk, we describe the architecture of DHARMA and its development using SST-macro simulator that allows us to debugging and performance analysis on future extreme scale systems.
This work is joint with Janine Bennett, Robert Clay, John Floren, Ken Franko, Saurabh Hukerikar, Samuel Knight, Hemanth Kolla, Greg Sjaardema, Nicole Slattengren, and Jermiah Wilke.