Virginia Tech® home

Seminar: Co-designing Distributed Systems and Storage Stacks for Improved Reliability

Ramnatthan Alagappan

Postdoctoral Researcher, VMware Research Group

Thursday, February 3, 2022

10:00 AM

Via Zoom

 

Image of Ramnatthan Alagappan

Abstract

Distributed storage systems form the core of modern cloud services. Like many systems software, these systems are built using layering: designers layer distributed protocols (e.g., Paxos, 2PC) upon local storage stacks. Such layering abstracts details about the local storage stack to the layers above, easing development. I will show that such black-box layering, unfortunately, masks vital information, resulting in poor reliability. I will then demonstrate that it is greatly beneficial to expose useful information across layers of a distributed storage system (while hiding unimportant details). In particular, I will show that reliability can be significantly improved by co-designing distributed systems and storage stacks. In the first half of the talk, I will show how local problems in the storage layer can lead to data loss, corruption, and unavailability in widely used distributed storage systems. I then present CTRL, a new approach that co-designs the storage stack and the distributed layers to cooperate with each other to perform correct recovery. I implement CTRL in two practical systems and show that CTRL incurs negligible performance overhead while significantly improving resiliency to storage faults.

Biography

Ram Alagappan is a postdoctoral researcher at the VMware Research Group. He earned his Ph.D., working with Professors Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau at the University of Wisconsin - Madison. His work has been published at top systems venues, invited to journals, and has won three best paper awards (FAST 17, 18, and 20). His dissertation also won an honorable mention for the UW CS Best Dissertation. His open-source frameworks have had a practical impact: these tools have exposed more than 80 severe vulnerabilities across 20 widely used systems. Ideas from his work on CTRL have been adopted by a financial database to make it resilient to storage faults.