TMLR 2025

StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Yishan Wang,  Tsai-Ning Wang,  Mathias Funk,  Aaqib Saeed

Eindhoven University of Technology

Clinical Decision Support  ·  StethoLM
🫀
Heart Sound
Clinical Question
Text Audio 🩺 StethoLM Audio-Language Model
Classification

* Outputs are model-generated responses for research purposes only and not intended for clinical deployment.

Abstract

Listening to heart and lung sounds — auscultation — is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support.

We present StethoLM, the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction–response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories. Through multi-stage training combining supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data.

Overview

StethoLM architecture overview
StethoLM architecture. A domain-adapted audio encoder (COLA) is connected to MedGemma-4B-IT via an MLP prefix projector with LoRA fine-tuning, enabling instruction-driven clinical reasoning over cardiopulmonary recordings.

StethoBench Task Coverage

StethoBench covers seven distinct clinical task categories, spanning the full range of auscultation analysis — from simple binary decisions to complex multi-step clinical reasoning. The 77,027 instruction–response pairs are synthesized from 16,125 labeled recordings across cardiac and pulmonary domains.

StethoBench task categories
Clinical task categories in StethoBench. Examples of instructions (left) and model responses (right) across all seven task types.
Classification

Binary detection of normal vs. abnormal sounds

Detection

Identify specific acoustic events (e.g., wheeze, crackle, murmur)

Reporting

Generate a structured clinical description of findings

Reasoning

Explain the clinical significance of acoustic observations

Diff. Diagnosis

Rank candidate diagnoses by likelihood from audio evidence

Comparison

Compare findings across multiple recordings or time points

Location

Identify the anatomical site of auscultation findings

Results

StethoLM achieves 71.8% BERTScore F1 and 47.8% clinical accuracy on StethoBench, outperforming all baselines. On four out-of-domain datasets it achieves 64.8% BERTScore and 25.2% accuracy, ranking first on three of four benchmarks.

Task LLM
BertS / Acc
Pengi
BertS / Acc
LTU
BertS / Acc
GAMA
BertS / Acc
Gemma3N
BertS / Acc
Gemini-2.5-Flash
BertS / Acc
Qwen2.5-Omni
BertS / Acc
StethoLM
BertS / Acc
Classification48.6 / 4.933.6 / 7.649.0 / 2.546.4 / 7.551.3 / 31.349.3 / 49.558.9 / 31.075.5 / 66.4
Detection43.0 / 2.830.8 / 4.343.1 / 3.644.0 / 4.247.1 / 11.545.9 / 19.252.5 / 22.070.4 / 47.9
Reporting41.1 / 2.230.7 / 4.145.1 / 2.044.7 / 2.047.5 / 9.249.8 / 13.256.7 / 12.772.8 / 36.2
Reasoning41.8 / 0.027.6 / 1.043.7 / 4.046.5 / 5.345.2 / 9.047.6 / 22.255.6 / 29.071.4 / 44.1
Diff. Diagnosis42.3 / 3.326.6 / 4.144.9 / 4.044.6 / 4.044.5 / 3.743.6 / 13.260.2 / 19.867.7 / 30.6
Comparison45.1 / 2.328.4 / 1.747.2 / 2.145.2 / 1.947.7 / 12.746.9 / 15.457.8 / 22.770.7 / 40.5
Location44.4 / 4.326.4 / 1.146.4 / 4.446.0 / 2.249.0 / 11.045.6 / 14.354.6 / 11.372.0 / 36.8
Overall43.8 / 2.829.2 / 3.445.6 / 3.245.3 / 3.947.5 / 12.647.0 / 21.056.5 / 21.271.8 / 47.8

BertS = BERTScore F1 (%), Acc = clinical accuracy (%). Bold = best, underline = second best.

Dataset LLM
BertS / Acc
Pengi
BertS / Acc
LTU
BertS / Acc
GAMA
BertS / Acc
Gemma3N
BertS / Acc
Gemini-2.5-Flash
BertS / Acc
Qwen2.5-Omni
BertS / Acc
StethoLM
BertS / Acc
TR44.0 / 0.527.7 / 1.644.2 / 0.644.0 / 0.348.4 / 5.644.8 / 7.560.7 / 17.766.2 / 25.7
CinC39.5 / 1.129.4 / 1.342.3 / 0.442.6 / 0.744.5 / 5.245.9 / 21.554.0 / 12.463.3 / 22.2
BMD-HS45.0 / 2.527.3 / 1.447.2 / 1.246.7 / 4.448.2 / 11.740.7 / 20.958.6 / 17.367.3 / 30.4
FluSense45.8 / 0.330.6 / 15.645.8 / 9.447.3 / 22.350.7 / 9.452.3 / 14.359.4 / 37.161.5 / 23.2
Overall43.6 / 1.128.8 / 5.044.9 / 2.945.2 / 6.948.0 / 8.045.9 / 16.158.2 / 21.164.8 / 25.2

BertS = BERTScore F1 (%), Acc = clinical accuracy (%). Bold = best, underline = second best.

Citation

@article{wang2025stetholm,
  title   = {StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks},
  author  = {Wang, Yishan and Wang, Tsai-Ning and Funk, Mathias and Saeed, Aaqib},
  journal = {Transactions on Machine Learning Research},
  year    = {2025},
  url     = {https://openreview.net/forum?id=i9RuUH9Jyj}
}