Head-related transfer functions (HRTFs) are fundamental to spatial audio via binaural rendering. Personalized HRTFs have been shown to improve localization accuracy and reduce perceptual artifacts and directional ambiguities. However, acquiring such HRTFs is time-consuming and requires costly measurement setups. To address this limitation, this article investigates the use of deep learning models to estimate personalized HRTFs from ear shape representations. We propose and evaluate three different architectures with various types of input data and identify the minimum achievable spectral distance error when predicting true HRTFs magnitude spectra. The best model we evaluated achieves a test Log Spectral Distortion (LSD) of 4.93 dB. We also established a performance ranking based on input data types and architectural choices.
This paper describes an experiment to investigate how the localisation performance of a neural network for Sound Source Localisation named `SampleDOA\_SR' would be affected by reducing the sample rate of the audio training data. Reducing the sample rate has several benefits; most notably a reduction in training time. The goal is to determine an appropriate sample rate which balances both localisation accuracy and training time. This information will be used to inform the future training of a neural network for Sound Source Localisation which will be used in a stereo upmixing pipeline. The results of this experiment indicate reducing the sample rate from 48kHz down to below 4kHz results in a significant decrease in localisation accuracy. However, above 4kHz, the decrease in localisation accuracy is minimal whilst training time is reduced significantly. This suggests providing the particular application for the model does not require the highest level of accuracy, a minimal reduction in localisation performance may be acceptable to obtain a large reduction in training time which would also reduce the environmental impact of the model training. A sample rate of 16kHz is suggested as a suitable balance between accuracy and training time.
Binaural hearing supports effective communication in complex acoustic environments by enabling listeners to segregate spatially separated sound sources, a benefit referred to as spatial release from masking (SRM). The spatial cues that give rise to SRM are determined by the head-related transfer function (HRTF). Although individual HRTFs are generally considered optimal for accurate localisation, prior work suggests they do not necessarily maximise performance across all aspects of spatial perception, including SRM. This motivates the concept of application-specific HRTFs. Here, we propose an application-specific HRTF augmentation method to improve speech intelligibility in cocktail-party scenarios, focusing on front–back configurations where SRM is limited. HRTFs are parameterised using principal component analysis and optimised via a differentiable auditory-model-based objective to enhance spectral cues while constraining interaural level differences. The method yields model-predicted SRM gains of 4–9 dB without inducing substantial predicted lateralisation artefacts.
This study is motivated by an ambition to determine the ‘best’-matching HRTFs during an onboarding task for an audio-only virtual reality (VR) experience using a ‘shooting down sound sources’ task. The study is motivated by the needs of blind and visually impaired gamers, who may rely more crucially on accurate rendering of auditory spatial cues for succeeding in the audio-only VR experience. We present an exploratory study applying an experimental VR test platform that renders ‘target’ sound sources in a virtual environment and logs tracking characteristics of head, hand-held controller and body while participants localise and ‘shoot’ audible ‘targets’ that are visible (for task familiarisation) and invisible. Four game-relevant sound stimuli and three different HRTFs were tested across eight sessions on two separate days. In this study, we show data collected from fifteen seeing participants, which demonstrate an ability to localise the sound sources accurately. The tracking data suggests various search patterns (e.g. hemisphere swaps and direction reversals) associated with ‘weak’ localisation cues and possible ambiguities. The search patterns are likely all quantifiable via angular error, response time, path length, search directions, number of reversals, and search speed as determined from the tracking characteristics.