Shiyue Hu, MS | LARK: NLP & AI Research Lab @ CU

A Milestone Year - LARK Lab’s EMNLP 2025 with Two Papers!

Wed, 12 Nov 2025 00:00:00 +0000

We’re thrilled to share that two of our papers were accepted and presented at EMNLP 2025, held this year in Suzhou, China: one in the Main Conference and one in the Findings of EMNLP.

As one of the top two international conferences in Natural Language Processing (NLP) and Artificial Intelligence (AI), EMNLP represents a premier venue for cutting-edge research in language technologies.

This milestone marks the first year of the LARK Lab, highlighting our core research agenda at the intersection of large language models, reasoning, and trustworthy AI in healthcare.

Together, these works exemplify LARK Lab’s commitment to advancing foundational NLP methods for safe and explainable AI in critical decision-making.

Paper 1: Main Conference Title: Simple Yet Effective: An Information‑Theoretic Approach to Multi‑LLM Uncertainty Quantification (Kruse et al., EMNLP 2025)

Paper 2: Findings Track Title: Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction (Kruse et al., Findings EMNLP 2025)

🔍 Looking Ahead: Core Research Directions in LARK Lab These two papers represent the core pillars of our lab’s research agenda:

Uncertainty estimation and calibration in LLMs, ensuring that the model makes honest decisions.
Temporal reasoning and multi-modal understanding, enabling LLMs to interpret longitudinal and context-rich clinical data.
Knowledge-grounded generation and mechanistic interpretability, bridging foundational NLP with safe and transparent AI applications in healthcare. Together, these studies form the basis of our ongoing work on Safe and Transparent Diagnostic AI, supported by the National Library of Medicine (NLM) R00 award [LM014308], and will continue to shape our future grant projects and collaborations.

🌟 Acknowledgement: We are deeply grateful to our collaborators and co-authors whose expertise and dedication made these works possible.

On the MUSE paper, we collaborate with the University of Wisconsin-Madison:

Dr. Majid Afshar
Dr. Anoop Mayampurath
Dr. Guanhua Chen They are our lab’s long-term collaborators.

For our Findings paper, we thank: Internal - CU Anschutz

Dr. Elizabeth Goldberg (Emergency Medicine)
Dr. Samantha Stonbraker (College of Nursing)

Northeastern University - For contributing their valuable insights from NLP and HCI experiences

Dr. Bingsheng Yao
Dr. Dakuo Wang

AMIA 2025 NLP Annual Symposium - Poster

Fri, 24 Oct 2025 00:00:00 +0000

We are excited to announce that our paper “Evaluating Large Language Models for Summarizing Long Clinical Texts and Longitudinal Patient Trajectories” has been accepted to the AMIA 2025 NLP Annual Symposium!

In this paper, we systematically evaluate several state-of-the-art open-source LLMs, along with their Retrieval-Augmented Generation (RAG) and chain-of-thought (CoT) variants, on long-context clinical summarization and prediction tasks. Our study re-engineers existing EHR-based tasks, such as discharge summarization and diagnosis prediction, to test how well LLMs integrate structured and unstructured patient data over time.

We find that while longer context windows improve input integration, they do not consistently enhance clinical reasoning, particularly in temporal progression and the prediction of rare diseases. This work establishes a foundation for evaluating LLMs on complex, multi-modal clinical data and highlights key challenges in achieving temporally coherent clinical reasoning.

Collaborating Authors:

Dr. Samantha Stonbraker and Dr. Elizabeth Goldberg – University of Colorado Anschutz Medical Campus
Dr. Bingsheng Yao and Dr. Dakuo Wang – Northeastern University

Please stop by our poster during the poster session on November 18, 2025, from 5:00 to 6:30 PM.

Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction

Thu, 04 Sep 2025 00:00:00 +0000

This work takes on a pressing challenge in clinical AI: handling longitudinal patient trajectories (structured + unstructured EHR data across time) for summarization and prediction tasks using LLMs. The authors evaluate several open‐source LLMs (and their Retrieval-Augmented Generation (RAG) variants and chain-of-thought prompting) on re‐engineered tasks from two public EHR datasets.

Key findings:

Using long context windows (i.e., more of the patient history) does help input integration (i.e., the model sees more data) but does not consistently improve clinical reasoning.
LLMs continue to struggle with temporal progression (i.e., reasoning about changes over time). Use of RAG led to some improvements in reducing hallucination, but did not fully fix the temporal or rare‐disease limitations.
The study thus establishes a benchmark/evaluation foundation for “LLMs + multi-modal long clinical text + temporal reasoning” — an important domain gap.

Why does this matter for our lab?

Very aligned with our work: we’re focused on diagnostic AI in critical care, dealing with EHRs, sequential annotation, temporal progression, and uncertainty. This paper addresses precisely those pain points: lengthy clinical documents, temporal reasoning, and rare conditions.
The finding that more context doesn’t automatically yield better reasoning is crucial — it alerts us to avoid the “context window = silver bullet” assumption.
The use of RAG and chain‐of‐thought in a clinical setting highlights what methods already exist — and where they fall short — providing a clearer roadmap for where we might innovate (e.g., multi-agent LLMs, uncertainty calibration, mechanistic bias analysis).