Saksham Khatwani, MS | LARK: NLP & AI Research Lab @ CU

A Milestone Year - LARK Lab’s EMNLP 2025 with Two Papers!

Wed, 12 Nov 2025 00:00:00 +0000

We’re thrilled to share that two of our papers were accepted and presented at EMNLP 2025, held this year in Suzhou, China: one in the Main Conference and one in the Findings of EMNLP.

As one of the top two international conferences in Natural Language Processing (NLP) and Artificial Intelligence (AI), EMNLP represents a premier venue for cutting-edge research in language technologies.

This milestone marks the first year of the LARK Lab, highlighting our core research agenda at the intersection of large language models, reasoning, and trustworthy AI in healthcare.

Together, these works exemplify LARK Lab’s commitment to advancing foundational NLP methods for safe and explainable AI in critical decision-making.

Paper 1: Main Conference Title: Simple Yet Effective: An Information‑Theoretic Approach to Multi‑LLM Uncertainty Quantification (Kruse et al., EMNLP 2025)

Paper 2: Findings Track Title: Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction (Kruse et al., Findings EMNLP 2025)

🔍 Looking Ahead: Core Research Directions in LARK Lab These two papers represent the core pillars of our lab’s research agenda:

Uncertainty estimation and calibration in LLMs, ensuring that the model makes honest decisions.
Temporal reasoning and multi-modal understanding, enabling LLMs to interpret longitudinal and context-rich clinical data.
Knowledge-grounded generation and mechanistic interpretability, bridging foundational NLP with safe and transparent AI applications in healthcare. Together, these studies form the basis of our ongoing work on Safe and Transparent Diagnostic AI, supported by the National Library of Medicine (NLM) R00 award [LM014308], and will continue to shape our future grant projects and collaborations.

🌟 Acknowledgement: We are deeply grateful to our collaborators and co-authors whose expertise and dedication made these works possible.

On the MUSE paper, we collaborate with the University of Wisconsin-Madison:

Dr. Majid Afshar
Dr. Anoop Mayampurath
Dr. Guanhua Chen They are our lab’s long-term collaborators.

For our Findings paper, we thank: Internal - CU Anschutz

Dr. Elizabeth Goldberg (Emergency Medicine)
Dr. Samantha Stonbraker (College of Nursing)

Northeastern University - For contributing their valuable insights from NLP and HCI experiences

Dr. Bingsheng Yao
Dr. Dakuo Wang

Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification

Wed, 12 Nov 2025 00:00:00 +0000

Large language models (LLMs) are increasingly used in high-stakes settings, and a key question is: how confident are they in their predictions? This paper argues that most prior work focuses on calibration and uncertainty quantification for individual models, and often ignores the potential benefit of model diversity. The authors posit that because different LLMs are trained differently (and language itself is Zipfian), their predictions will differ in complementary ways. The key idea: leverage that diversity by aggregating subsets of LLMs to produce better uncertainty estimates. To this end, the paper proposes MUSE (Multi-LLM Uncertainty via Subset Ensembles) — an information-theoretic method that uses Jensen-Shannon Divergence (JSD) to pick well-calibrated subsets of LLMs and then aggregate their predictions. Empirically, the work shows that on binary prediction tasks, using MUSE improves both calibration and predictive performance compared to a single model or a naive ensemble. Additionally, they explore using MUSE’s outputs to fine-tune (via chain-of-thought distillation) an LLM specifically for calibration.

Why does this matter for our lab?

The theme aligns directly with our interest in LLM calibration and uncertainty estimation, especially in applied domains (e.g., clinical or critical‐care settings).
It shifts the lens from “one LLM behaving badly/unreliable” to “multiple LLMs collaborating/being aggregated”. This opens new design space: ensemble strategies, subset selection, metrics beyond accuracy.
The use of an information‐theoretic metric (JSD) for subset identification is elegant and theoretically grounded — so there may be opportunities to generalize or adapt to our own multi-model pipelines (e.g., multi‐agent LLMs in ICU decision support).
The fine‐tuning via chain-of-thought distilled from the ensemble is interesting: suggests that uncertainty‐aware calibration isn’t only a post-hoc measure, but can be baked into model training.

GenAI4Health workshop at NeurIPS 2025 - Presentation

Mon, 03 Nov 2025 00:00:00 +0000

We are delighted to announce that our paper, “Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning,” has been accepted to the GenAI4Health workshop at NeurIPS 2025!

In this paper, we explore a new paradigm for incorporating knowledge graphs (KGs) by framing our problem as reward modeling over KG paths. This framing is motivated by a fundamental observation in computational theory, which states that verifying a solution is often easier than generating one from scratch.

We examine five different task formulations and train models using various techniques, including supervised fine-tuning, preference alignment, and chain-of-thought distillation. While we observe promising results for task-specific formulations, we also notice brittleness in terms of generalizability.

This research is a collaboration with:

Dr. Majid Afshar (University of Wisconsin–Madison)
Dr. Dmitriy Dligach (Loyola University Chicago)

Join us for the lightning talk! Saksham Khatwani will present our work on December 6. Slides and code will be released soon. Stay tuned!