Seminar: Rethinking AI Systems Through Efficient Model Communication
Yuhan Liu
PhD Student
University of Chicago
Monday, February 23, 2026
9:30 - 10:30 a.m.
1100 Torgersen Hall
Abstract
For decades, AI models interacted with humans directly through human-centric inputs and outputs. Today, they are used much more ubiquitously and often interact through complex software systems, interacting with other models or software rather than directly with humans. This paradigm shift raises a natural question: can models interact with other models and software using model-native languages?
In this talk, I will present my work on facilitating model-native interactions among models and between models and software. To enable more efficient and practical model interactions using model-native states (i.e., KV cache) in LLM systems, my work CacheGen is the first system to share KV cache across different user queries by compressing it into compact bitstreams, and my work DroidSpeak is the first system to share KV cache across different models. My research made real-world impacts via the open-source project, LMCache, widely used in production by top-tier AI companies. Together, these works make LLM inference 5–10× faster than state-of-the-art inference engines. To enable more accurate model-to-software communication, my work ChameleonAPI encodes software code structure into model-native loss functions, allowing models to be retrained for up to 43% higher application-level accuracy in vision applications.
Biography
Yuhan Liu is a final-year PhD student at the University of Chicago, co-advised by Junchen Jiang and Shan Lu. Her research interest is in building efficient large-scale system and networking support for ML model inference. Her works appeared in top computer system/networking conferences, such as OSDI, SIGCOMM, NSDI. She received MIT EECS rising star, EuroSys best paper award, and UChicago’s Neubauer PhD fellowship for her research. She also leads two open-source projects that build large-scale KV caching layer for efficient LLM inference, and are used in over 30 companies in production, including Google Cloud, Amazon AWS, NVIDIA, IBM etc.