StethoLM

Clinical Decision Support · StethoLM

🫀

Heart Sound

Clinical Question

Classification

* Outputs are model-generated responses for research purposes only and not intended for clinical deployment.

Abstract

Listening to heart and lung sounds — auscultation — is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support.

We present StethoLM, the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction–response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories. Through multi-stage training combining supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data.

Overview

StethoBench Task Coverage

StethoBench covers seven distinct clinical task categories, spanning the full range of auscultation analysis — from simple binary decisions to complex multi-step clinical reasoning. The 77,027 instruction–response pairs are synthesized from 16,125 labeled recordings across cardiac and pulmonary domains.

StethoBench task categories — **Clinical task categories in StethoBench.** Examples of instructions (left) and model responses (right) across all seven task types.

Classification

Binary detection of normal vs. abnormal sounds

Detection

Identify specific acoustic events (e.g., wheeze, crackle, murmur)

Reporting

Generate a structured clinical description of findings

Reasoning

Explain the clinical significance of acoustic observations

Diff. Diagnosis

Rank candidate diagnoses by likelihood from audio evidence

Comparison

Compare findings across multiple recordings or time points

Location

Identify the anatomical site of auscultation findings

Results

StethoLM achieves 71.8% BERTScore F1 and 47.8% clinical accuracy on StethoBench, outperforming all baselines. On four out-of-domain datasets it achieves 64.8% BERTScore and 25.2% accuracy, ranking first on three of four benchmarks.

Task	LLM BertS / Acc	Pengi BertS / Acc	LTU BertS / Acc	GAMA BertS / Acc	Gemma3N BertS / Acc	Gemini-2.5-Flash BertS / Acc	Qwen2.5-Omni BertS / Acc	StethoLM BertS / Acc
Classification	48.6 / 4.9	33.6 / 7.6	49.0 / 2.5	46.4 / 7.5	51.3 / 31.3	49.3 / 49.5	58.9 / 31.0	75.5 / 66.4
Detection	43.0 / 2.8	30.8 / 4.3	43.1 / 3.6	44.0 / 4.2	47.1 / 11.5	45.9 / 19.2	52.5 / 22.0	70.4 / 47.9
Reporting	41.1 / 2.2	30.7 / 4.1	45.1 / 2.0	44.7 / 2.0	47.5 / 9.2	49.8 / 13.2	56.7 / 12.7	72.8 / 36.2
Reasoning	41.8 / 0.0	27.6 / 1.0	43.7 / 4.0	46.5 / 5.3	45.2 / 9.0	47.6 / 22.2	55.6 / 29.0	71.4 / 44.1
Diff. Diagnosis	42.3 / 3.3	26.6 / 4.1	44.9 / 4.0	44.6 / 4.0	44.5 / 3.7	43.6 / 13.2	60.2 / 19.8	67.7 / 30.6
Comparison	45.1 / 2.3	28.4 / 1.7	47.2 / 2.1	45.2 / 1.9	47.7 / 12.7	46.9 / 15.4	57.8 / 22.7	70.7 / 40.5
Location	44.4 / 4.3	26.4 / 1.1	46.4 / 4.4	46.0 / 2.2	49.0 / 11.0	45.6 / 14.3	54.6 / 11.3	72.0 / 36.8
Overall	43.8 / 2.8	29.2 / 3.4	45.6 / 3.2	45.3 / 3.9	47.5 / 12.6	47.0 / 21.0	56.5 / 21.2	71.8 / 47.8

BertS = BERTScore F1 (%), Acc = clinical accuracy (%). Bold = best, underline = second best.

Dataset	LLM BertS / Acc	Pengi BertS / Acc	LTU BertS / Acc	GAMA BertS / Acc	Gemma3N BertS / Acc	Gemini-2.5-Flash BertS / Acc	Qwen2.5-Omni BertS / Acc	StethoLM BertS / Acc
TR	44.0 / 0.5	27.7 / 1.6	44.2 / 0.6	44.0 / 0.3	48.4 / 5.6	44.8 / 7.5	60.7 / 17.7	66.2 / 25.7
CinC	39.5 / 1.1	29.4 / 1.3	42.3 / 0.4	42.6 / 0.7	44.5 / 5.2	45.9 / 21.5	54.0 / 12.4	63.3 / 22.2
BMD-HS	45.0 / 2.5	27.3 / 1.4	47.2 / 1.2	46.7 / 4.4	48.2 / 11.7	40.7 / 20.9	58.6 / 17.3	67.3 / 30.4
FluSense	45.8 / 0.3	30.6 / 15.6	45.8 / 9.4	47.3 / 22.3	50.7 / 9.4	52.3 / 14.3	59.4 / 37.1	61.5 / 23.2
Overall	43.6 / 1.1	28.8 / 5.0	44.9 / 2.9	45.2 / 6.9	48.0 / 8.0	45.9 / 16.1	58.2 / 21.1	64.8 / 25.2

BertS = BERTScore F1 (%), Acc = clinical accuracy (%). Bold = best, underline = second best.

Citation

@article{wang2025stetholm,
  title   = {StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks},
  author  = {Wang, Yishan and Wang, Tsai-Ning and Funk, Mathias and Saeed, Aaqib},
  journal = {Transactions on Machine Learning Research},
  year    = {2025},
  url     = {https://openreview.net/forum?id=i9RuUH9Jyj}
}