Virginia Tech®home

Seminar: Reliable Operation of Heterogeneous Systems: Challenges and Opportunities

Lishan Yang

College of William & Mary

Friday, March 4, 2022
10:00 AM
1100 Torgersen Hall

Abstract

Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. One of the major challenges in the domain of GPU reliability is to accurately measure general purpose GPU (GPGPU) application resilience to transient faults. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Alternatively, fault site selection techniques have been proposed to approach high accuracy with less fault injection experiments. However, most of the existing methods in the literature only focus on one input.

In this talk, I will discuss how to perform input-aware resilience estimation and fortification on GPGPU applications. First, I will introduce an input-aware estimation methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by focusing on the effect of input size on the application resilience profile. Then, based on the observations from the estimation, I will present a fortification methodology that aims to map threads with the same resilience characteristics to the same warp and perform protection accordingly. Finally, I will discuss my future research directions.

Biography

Lishan Yang is a Ph.D. candidate in the Computer Science Department at the College of William & Mary, under the supervision of Prof. Evgenia Smirni. Her research interest falls in GPU architecture, reliability analysis, performance analysis, workload characterization of large scale systems, reliability of HPC and large scale systems. Her Ph.D. research has been published in top conferences such as MICRO, Sigmetrics, and ICSE. Before coming to W&M, She received her bachelor degree in computer science from University of Science and Technology of China (USTC) in 2016.