Auscultation, particularly heart sound, is a non-invasive
technique that provides essential vital sign information.
Recently, self-supervised acoustic representation founda-
tion models (FMs) have been proposed to offer insights
into acoustics-based vital signs. However, there has been
little exploration of the extent to which auscultation is
encoded in these pre-trained FM representations. In this
work, using a publicly available phonocardioram (PCG)
dataset and a heart rate (HR) estimation model, we con-
duct a layer-wise investigation of six acoustic representa-
tion FMs: HuBERT, wav2vec2, wavLM, Whisper, Con-
trastive Language-Audio Pretraining (CLAP), and an in-
house CLAP model. Additionally, we implement the
baseline method from [1] (which relies on acoustic fea-
tures), and show that overall, representation vectors from
pre-trained foundation models (FMs) offer comparable
performance to the baseline. Notably, HR estimation
using the representations from the audio encoder of the
in-house CLAP model outperforms the results obtained
from the baseline, achieving a lower mean absolute error
(MAE) across various train/validation/test splits despite
the domain mismatch.
- † University of North Carolina at Chapel Hill
- § Johns Hopkins University
- ‡ Work done while at Apple