Seminar: Machine Translation for All: Improving Machine Translation in Low Resource, Domain Mismatch & Noisy Training Settings
PhD Candidate, The John Hopkins University
Friday, February 26, 2021
Machine translation uses machine learning to automatically translate text from one language to another and has the potential to reduce language barriers. Recent improvements in machine translation have made it more widely-usable, partly due to deep neural network approaches. However—like most deep learning algorithms—neural machine translation is sensitive to the quantity and quality of training data, and therefore produces poor translations for some languages and styles of text. Machine translation training data typically comes in the form of parallel text—sentences translated between the two languages of interest. Limited quantities of parallel text are available for most language pairs, leading to a low-resource problem. Even when training data is available in the desired language pair, it is frequently formal text—leading to a domain mismatch when models are used to translate a different type of data, such as social media or medical text. Neural machine translation currently performs poorly in low-resource and domain mismatch settings; my work aims to overcome these limitations, and make machine translation a useful tool for all users.
In this talk, I will discuss a method for improving translation in low resource settings—Simulated Multiple Reference Training (SMRT; Khayrallah et al., 2020)—which uses a paraphraser to simulate training on all possible translations per sentence. I will also discuss work on improving domain adaptation (Khayrallah et al., 2018), and work on analyzing the effect of noisy training data (Khayrallah and Koehn, 2018).
Huda Khayrallah is a PhD candidate in Computer Science at The Johns Hopkins University where she is advised by Philipp Koehn. She is part of the Center for Language and Speech Processing and the machine translation group. She works on applied machine learning for Natural Language Processing, primarily machine translation. Her work focuses on overcoming deep learning’s sensitivity to the quantity and quality of the training data, including low resource and domain adaptation settings. In Summer 2019, she was a research intern at Lilt, working on translator-in-the-loop machine translation. She holds an MSE in Computer Science from Johns Hopkins (2017), and a BA in Computer Science from UC Berkeley (2015). More info about her can be found on her website: http://www.cs.jhu.edu/~huda