In collaboration with Kevin Johnson, MD’s AI for Ambulatory Care Innovation lab, we’re reimagining what patient care would look like if the clinic were instrumented with modern robotic sensors, enabling real-time clinical decision support. Throughout the encounter, and even
before the provider entered the room, multimodal AI models would quantify the patient’s physical, cognitive,
and emotional health, augmenting the clinical exam, and providing
for quantitative longitudinal assessment. As first steps toward this goal, we are developing ML models to characterize patient-provider
interactions (Jang et al., 2025), patient gait, and cognitive issues, framing the problem as medical visual question answering (Park et al., 2025). Critically, we’re developing these methods to preserve patient privacy, ensure transparecy and explinability, and avoid interference with the clinician-patient relationship. We’re also exploring related approaches for spatio-temporal clinical understanding of surgery (Liao et al., 2025; Liao et al., 2025)(with Daniel Hashimoto, MD) and in trauma bays (with Jeremy Canon, MD).
We're developing ambient AI systems that act in collaboration with the provider and patient to provide improved clinical decision support.
Technology has increasingly hindered meaningful engagement
between patient and providers during primary care visits, often detracting from effective communication. However,
artificial intelligence (AI) advancements present new opportunities
to enhance and improve patient-provider communication. A promising
application is the use of AI to identify
and highlight agenda items for discussion during visits and
to summarize relevant clinical details in real-time. This study
explores the feasibility, potential, and challenges of developing a
real-time automated agenda-setting system leveraging
generative AI, specifically large language models (LLMs).
From a dataset of recorded and annotated simulation visits,
we evaluate the performance of LLMs in identifying agenda
items and capturing associated clinical details within the conversation
flow. In particular, we focus on the impact of realtime constraints and
contextual factors on the ability to detect and summarize relevant items.
Our findings suggest that optimizing performance requires a balance
between providing contextual information through both summaries and the
actual conversation. Based on these results, we discuss the
challenges involved in developing a real-time agenda-setting
system and offer recommendations for future advancements.
@inproceedings{Jang2025Towards,author={Jang, Kuk Jin and Bhatti, Sameer and Pugh, Sydney and Maduno, Chimezie and Sridhar, Sarang and Mopidevi, Sriharsha and Eaton, Eric and Johnson, Kevin},year={2025},title={Towards a Real-time Clinical Agenda Setting System for Enhancing Clinical Interactions in Primary Care Visits},booktitle={Workshop on Large Language Models and Generative AI for Health at AAAI 2025},}
Multimodal large language models (MLLMs) can simultaneously
process visual, textual, and auditory data, capturing
insights that complement human analysis. However, existing
video question-answering (VidQA) benchmarks and datasets
often exhibit a bias toward a single modality, despite the goal
of requiring advanced reasoning skills that integrate diverse
modalities to answer the queries.
In this work, we introduce the modality importance score
(MIS) to identify such bias. It is designed to assess which
modality embeds the necessary information to answer the
question. Additionally, we propose an innovative method using
state-of-the-art MLLMs to estimate the modality importance,
which can serve as a proxy for human judgments of
modality perception.With this MIS, we demonstrate the presence
of unimodal bias and the scarcity of genuinely multimodal
questions in existing datasets. We further validate the
modality importance score with multiple ablation studies to
evaluate the performance of MLLMs on permuted feature
sets. Our results indicate that current models do not effectively
integrate information due to modality imbalance in existing
datasets. Our proposed MLLM-derived MIS can guide
the curation of modality-balanced datasets that advance multimodal
learning and enhance MLLMs’ capabilities to understand
and utilize synergistic relations across modalities.
@inproceedings{Park2025AssessingModalityBias,author={Park, Jean and Jang, Kuk Jin and Alasaly, Basam and Mopidevi, Sriharsha and Zolensky, Andrew and Eaton, Eric and Lee, Insup and Johnson, Kevin},year={2025},title={Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models},booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},volume={39},number={19},pages={19821--19829},doi={10.1609/aaai.v39i19.34183},}
Learning efficient visual representations across heterogeneous
unlabeled datasets remains a central challenge in federated learning. Effective federated representations
require features that are jointly informative across clients while disentangling client-specific
factors without supervision. We thus introduce FORLA, a novel framework
for federated object-centric representation learning and feature adaptation using
unsupervised slot attention. At the core of our method is a shared feature adapter,
trained collaboratively across clients to adapt features from foundation models, and
a shared slot attention module that learns to reconstruct the adapted features. To
optimize this adapter, we design a two-branch student–teacher architecture. In each
client, a student decoder learns to reconstruct full features from foundation models,
while a teacher decoder reconstructs their adapted, low-dimensional counterpart.
The shared slot attention module bridges cross-domain learning by aligning object-level
representations across clients. Experiments in multiple real-world datasets
show that our framework not only outperforms centralized baselines on object
discovery but also learns a compact, universal representation that generalizes well
across domains. This work highlights federated slot attention as an effective tool
for scalable, unsupervised visual representation learning from cross-domain data
with distributed concepts. Our code, data, and pretrained models are available at:
https://github.com/PCASOlab/FORLA.
@article{Liao2025FORLA,author={Liao, Guiqiu and Jogan, Matjaz and Eaton, Eric and Hashimoto, Daniel A.},title={FORLA: Federated Object-Centric Representation Learning with Slot Attention},journal={Neural Information Processing Systems (NeurIPS)},year={2025},}
Weakly supervised video object segmentation (WSVOS)
enables the identification of segmentation maps without requiring
extensive annotations of object masks, relying instead on coarse
video labels indicating object presence.
WSVOS in surgical videos is, however, more challenging
due to the complex interaction of multiple transient objects,
such as surgical tools moving in and out of the surgical field.
In this scenario, state-of-the-art WSVOS methods struggle
to learn accurate segmentation maps. We address this problem by
introducing ViDeo Spatio-Temporal disentanglement
Networks (VDST-Net), a framework to disentangle complex
spatio-temporal object interactions using semi-decoupled
knowledge distillation to predict high-quality class activation
maps (CAMs). A teacher network is designed to help a
temporal-reasoning student network resolve activation conflicts,
as the student leverages temporal dependencies when
specifics about object location and timing in the video are
not provided. We demonstrate the efficacy of our framework
on a challenging surgical video dataset where objects are,
on average, present in less than 60% of annotated frames,
and compare our method to state-of-the-art methods on
surgical data and on a public dataset commonly used to
benchmark WSVOS. Our method outperforms state-of-theart
techniques and generates accurate segmentation masks
under video-level weak supervision. Our code is available
at: https://github.com/PCASOlab/VDST-net.
@inproceedings{Liao2025DisentanglingSpatioTemporal,author={Liao, Guiqiu and Jogan, Matjaz and Sambasastry, Sai Koushik Samudrala and Eaton, Eric and Hashimoto, Daniel},year={2025},title={Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video},booktitle={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},}