One paper about Large Language Models for Tabular Data Embedding is accepted to EMNLP 2024 as Findings!

We are thrilled to announce that our paper, “When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?” has been accepted to the Empirical Methods on Natural Language Processing (EMNLP) 2024 conference as a Findings paper!

This research examines a novel and comprehensive approach to evaluate large language models (LLMs) with electronic health record (EHR) data to improve medical diagnostics and prognostics.

While LLMs have made significant strides in natural language processing and data analysis, their application to tabular data, especially numerical data essential in clinical settings, remains underexplored. In this study, we investigate the effectiveness of vector representations from the last hidden states of LLMs for predicting key medical outcomes such as diagnoses, length of stay, and mortality. We compared these LLM embeddings against raw numerical EHR data using traditional machine learning algorithms, like eXtreme Gradient Boosting (XGBoost), that are designed to excel at tabular data learning.

Our findings are promising. Despite raw data still having an edge in most medical ML tasks, the zero-shot embeddings from instruction-tuned LLMs produced competitive results. This suggests exciting future directions for research in medical applications, particularly in how LLMs can be leveraged to enhance machine learning classifiers for medical prediction tasks.

Our paper is currently available at Arxiv: paper. We’re excited to present these findings at EMNLP. Stay tuned for more updates on our research!

Yanjun Gao
Yanjun Gao
Assistant Professor

My research interests include Natural Language Generation, Semantic Representation, Summarization Evaluation, Graph-based NLP, and AI applications in medicine and education.