Virginia Tech®home

Seminar: Automated Testing and Debugging for Data-centric Software

Muhammad Ali Gulzar

PhD Candidate, University of California Los Angeles

Monday, February 17, 2020
10:00am - 11:00am
655 McBryde Hall


Data-intensive scalable computing (DISC) systems such as MapReduce, Google FlumeJava, and Apache Spark are commonly used today to process terabytes of data. At this scale, rare and buggy corner cases frequently show up in production, leading to a crash after running for days or, worse, silently producing corrupted output. Unfortunately, in this domain, “testing on a random” sample rarely guarantees the reliability and “printf” debugging methods are expensive. Compared to traditional software, data-centric software poses new challenges in automatic debugging and testing because of the scale, distributed nature, and new programming paradigms.

In this talk, I will describe the insights behind techniques that make automated debugging and testing feasible for data-centric software. I will first emphasize the key differences between traditional and data-centric software and how they pose unique engineering challenges. Next, I will tackle those challenges on two fronts, i.e., debugging and testing. First, I will present BigDebug and BigSift that redesign interactive and automated debugging primitives tailored for data-centric software. I will show how we leverage ideas from systems and database research to reduce the debugging time by half and perform precise root-cause analysis in a fraction of the job execution time. Second, I will discuss BigTest that systematically explores dataflow program paths and automatically generates test data that is orders of magnitude smaller yet several times more effective in revealing critical bugs. Finally, I will conclude with a broader vision of designing productivity toolkits to support the growing needs of data-centric software in ML, AI, and data science.


Muhammad Ali Gulzar is a Ph.D. candidate at the University of California Los Angeles’s Department of Computer Science. His research designs and builds systems that improve developer productivity through automated debugging and testing of data-centric software. These systems bring together a unique combination of ideas from software engineering, distributed systems, and databases to accelerate the development of reliable big data applications. His research tools have inspired commercial data processing tools and have also been recognized with the 2017 Google Ph.D. fellowship award, 2018 ACM SRC gold medal, and 2016 “The Best of Vldb” award.