* Outputs are model-generated responses for research purposes only and not intended for clinical deployment.
Abstract
Listening to heart and lung sounds — auscultation — is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support.
We present StethoLM, the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction–response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories. Through multi-stage training combining supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data.
Overview
StethoBench Task Coverage
StethoBench covers seven distinct clinical task categories, spanning the full range of auscultation analysis — from simple binary decisions to complex multi-step clinical reasoning. The 77,027 instruction–response pairs are synthesized from 16,125 labeled recordings across cardiac and pulmonary domains.
Binary detection of normal vs. abnormal sounds
Identify specific acoustic events (e.g., wheeze, crackle, murmur)
Generate a structured clinical description of findings
Explain the clinical significance of acoustic observations
Rank candidate diagnoses by likelihood from audio evidence
Compare findings across multiple recordings or time points
Identify the anatomical site of auscultation findings
Results
StethoLM achieves 71.8% BERTScore F1 and 47.8% clinical accuracy on StethoBench, outperforming all baselines. On four out-of-domain datasets it achieves 64.8% BERTScore and 25.2% accuracy, ranking first on three of four benchmarks.
| Task | LLM BertS / Acc |
Pengi BertS / Acc |
LTU BertS / Acc |
GAMA BertS / Acc |
Gemma3N BertS / Acc |
Gemini-2.5-Flash BertS / Acc |
Qwen2.5-Omni BertS / Acc |
StethoLM BertS / Acc |
|---|---|---|---|---|---|---|---|---|
| Classification | 48.6 / 4.9 | 33.6 / 7.6 | 49.0 / 2.5 | 46.4 / 7.5 | 51.3 / 31.3 | 49.3 / 49.5 | 58.9 / 31.0 | 75.5 / 66.4 |
| Detection | 43.0 / 2.8 | 30.8 / 4.3 | 43.1 / 3.6 | 44.0 / 4.2 | 47.1 / 11.5 | 45.9 / 19.2 | 52.5 / 22.0 | 70.4 / 47.9 |
| Reporting | 41.1 / 2.2 | 30.7 / 4.1 | 45.1 / 2.0 | 44.7 / 2.0 | 47.5 / 9.2 | 49.8 / 13.2 | 56.7 / 12.7 | 72.8 / 36.2 |
| Reasoning | 41.8 / 0.0 | 27.6 / 1.0 | 43.7 / 4.0 | 46.5 / 5.3 | 45.2 / 9.0 | 47.6 / 22.2 | 55.6 / 29.0 | 71.4 / 44.1 |
| Diff. Diagnosis | 42.3 / 3.3 | 26.6 / 4.1 | 44.9 / 4.0 | 44.6 / 4.0 | 44.5 / 3.7 | 43.6 / 13.2 | 60.2 / 19.8 | 67.7 / 30.6 |
| Comparison | 45.1 / 2.3 | 28.4 / 1.7 | 47.2 / 2.1 | 45.2 / 1.9 | 47.7 / 12.7 | 46.9 / 15.4 | 57.8 / 22.7 | 70.7 / 40.5 |
| Location | 44.4 / 4.3 | 26.4 / 1.1 | 46.4 / 4.4 | 46.0 / 2.2 | 49.0 / 11.0 | 45.6 / 14.3 | 54.6 / 11.3 | 72.0 / 36.8 |
| Overall | 43.8 / 2.8 | 29.2 / 3.4 | 45.6 / 3.2 | 45.3 / 3.9 | 47.5 / 12.6 | 47.0 / 21.0 | 56.5 / 21.2 | 71.8 / 47.8 |
BertS = BERTScore F1 (%), Acc = clinical accuracy (%). Bold = best, underline = second best.
| Dataset | LLM BertS / Acc |
Pengi BertS / Acc |
LTU BertS / Acc |
GAMA BertS / Acc |
Gemma3N BertS / Acc |
Gemini-2.5-Flash BertS / Acc |
Qwen2.5-Omni BertS / Acc |
StethoLM BertS / Acc |
|---|---|---|---|---|---|---|---|---|
| TR | 44.0 / 0.5 | 27.7 / 1.6 | 44.2 / 0.6 | 44.0 / 0.3 | 48.4 / 5.6 | 44.8 / 7.5 | 60.7 / 17.7 | 66.2 / 25.7 |
| CinC | 39.5 / 1.1 | 29.4 / 1.3 | 42.3 / 0.4 | 42.6 / 0.7 | 44.5 / 5.2 | 45.9 / 21.5 | 54.0 / 12.4 | 63.3 / 22.2 |
| BMD-HS | 45.0 / 2.5 | 27.3 / 1.4 | 47.2 / 1.2 | 46.7 / 4.4 | 48.2 / 11.7 | 40.7 / 20.9 | 58.6 / 17.3 | 67.3 / 30.4 |
| FluSense | 45.8 / 0.3 | 30.6 / 15.6 | 45.8 / 9.4 | 47.3 / 22.3 | 50.7 / 9.4 | 52.3 / 14.3 | 59.4 / 37.1 | 61.5 / 23.2 |
| Overall | 43.6 / 1.1 | 28.8 / 5.0 | 44.9 / 2.9 | 45.2 / 6.9 | 48.0 / 8.0 | 45.9 / 16.1 | 58.2 / 21.1 | 64.8 / 25.2 |
BertS = BERTScore F1 (%), Acc = clinical accuracy (%). Bold = best, underline = second best.
Citation
@article{wang2025stetholm,
title = {StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks},
author = {Wang, Yishan and Wang, Tsai-Ning and Funk, Mathias and Saeed, Aaqib},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://openreview.net/forum?id=i9RuUH9Jyj}
}