Virginia Tech®home

Seminar: Data Quality in the Deep Learning Era

Ismini Lourentzou

Research Scientist, IBM Almaden Research Center

Monday, April 20, 2020
12:30pm - 1:30pm
(Zoom Only)


Deep Learning has been applied with great success in a variety of domains, opening opportunities for achieving human level performance in many applications, but with models often trained on millions of annotated instances. To produce reliable solutions, neural models typically depend on three data-related basic characteristics: volume, representative input and a good set of labels for the task at-hand. In real-world scenarios however, only a small portion of data available is labeled, clean and structured. As such, data quality proves to be a critical factor for the success of machine learning. Domain knowledge and human intervention are key components of strong predictive performance, however encoding such information in a learned model is often non-trivial and costly.

In this talk, I will present methods that can be incorporated to any machine learning model to improve upon data quality in terms of input quality, label quality and training procedures. First, I will focus on active learning strategies towards acquiring labels for any arbitrary user-defined task. I will show that the best performing active learning strategy depends on the task at-hand and will introduce an iterative elimination algorithm that learns a combination of active learning acquisition functions on the fly, maximizing annotation performance early in the process. Then, I will propose techniques for efficiently utilizing additional unlabeled data during the training process and move on to describe a lexical normalization hybrid encoder-decoder model that can serve as a pre-processing step for NLP tools to adapt to noisy informal text. Finally, I will conclude with ongoing work and future directions.


Ismini Lourentzou is a Research Scientist at IBM Almaden Research Center, Intelligence Augmentation Team. Her research interests lie at the intersection of machine learning, data science and big data. Her work is focused on statistical methods that improve the applicability of machine learning in high-expertise interdisciplinary domains with limited annotated data, such as genomics & public health, education, humanities, social computing and wearable technologies.

Ismini recently obtained her Ph.D. in Computer Science from the University of Illinois at Urbana – Champaign (UIUC), under the supervision of Professor ChengXiang Zhai. She was invited to participate in the Rising Stars in EECS 2019 workshop, has received a Microsoft Azure Research Award, an Outstanding Teaching Assistant Award and an IBM Invention Plateau. She holds two Bachelors, one in Computer Science from the Athens University of Economics and Business, Greece and one in Business Administration from the University of West Attica, Greece (formerly known as Technological Educational Institute of Athens). Prior to UIUC, she has worked in the Greek baking sector for nearly a decade, holding a variety of positions at National Bank of Greece and as a technical reviewer for technology-related publishing companies.