Seminar: Beyond High Availability: Taming Complex Failures in Cloud Systems
Ph.D. Candidate, Dept. of Computer Science
Johns Hopkins University
Wednesday, February 8, 2023
11:00 AM - 12:00 PM
Decades of effort have made cloud systems today robust and reliable. Yet the rise of complex failures becomes the new big threat to cloud system availability. As cloud systems continue growing in scale and complexity, failures become more and more subtle. Such failures break fundamental assumptions that fault-tolerant designs rely on. Thus existing cloud systems are inadequate to handle them, causing service unavailability and enormous economic loss. In this talk, I will share my experience building systems to handle three emerging types of failures. First, I will discuss the scenario when system components fail partially. I will describe my solution, OmegaGen, which generates customized checkers for system components to detect and localize such partial failures. Then I will talk about another tricky scenario when system components fail silently. I will share my study findings to demystify such subtle issues and present OathKeeper, a tool that infers likely rules from past failures to help expose the silent issues. I will briefly touch on dealing with fail-slow problems by discussing my project RESIN on solving memory leaks at cloud-scale infrastruture. In the end, I will conclude with exciting directions for building dependable clouds.
Chang Lou is a Ph.D. candidate in the Department of Computer Science at Johns Hopkins University, advised by Professor Ryan (Peng) Huang. His main research interests are distributed systems and operating systems. Specifically, his past work focuses on enhancing the capabilities of distributed systems to detect, localize, and react to complex failures at runtime. His work on partial failures received USENIX NSDI Best Paper Award (2020). His work on handling memory leaks in cloud infrastructure has been deployed globally in millions of servers at Microsoft Azure.