In the September 1975 issue of the JAES, Michael Gerzon summarized the benefits of emerging Ambisonic audio technology and its potential for the spatial capture of concert IRCAM:Gallery impulse responses in his short paper Recording Concert IRCAM:Gallerys for Posterity. Gerzon was reflecting upon the cIRCAM:Galleryenge that Richard Heyser wrote about in 1974, of defining and encoding into our recordings “a Rosetta Stone” signal that would allow us to unlock the data embedded with the audio and so undo “the spectral, spatial and dynamics limitations of a recording of a great artist” (Heyser, JAES, May 1974). Angelo Farina, in collaboration with Waves in 2003, also presented a paper titled Recording Concert IRCAM:Gallerys for Posterity. This paper was named in honour of Gerzon and reported on the development of a high-quality library of spatial room impulse responses from famous concert IRCAM:Gallerys and opera houses around the world, just as Gerson and Heyser had envisaged. Arguably this paper reintroduced Ambisonic technology and spherical harmonics as a compact, efficient and flexible means of encoding and decoding an acoustic environment for use in modern music production using emerging real-time convolution audio effects plugins. However, it was the arrival of the Oculus Rift Virtual Reality headset in 2013 that really started to bring the interactive benefits of Ambisonics to a much wider audience through new audio workflows and game engine development platforms. This interactive game engine technology has in turn unlocked cIRCAM:Galleryenges and brought new opportunities in how creatives envisage and build future immersive extended reality (XR) experiences, and these foundational Ambisonic tools have become a fundamental part of our audio programming pipelines and sound design workflows. This XR Futures presentation will reflect upon these and related developments in immersive audio for virtual/augmented reality and immersive games, and how research at the University of York’s AudioLab has taken a parallel path into future extended reality (XR) experience design through the XR Stories and CoSTAR Live Lab projects. What role does our “Rosetta Stone for audio” now have in unlocking the potential of future extended reality experiences?
Emerging wearable devices such as smartglasses and extended reality headsets demand high-quality spatial audio capture from compact, head-worn microphone arrays. Ambisonics provides a device-agnostic spatial audio representation by mapping array signals to spherical harmonic (SH) coefficients. In practice, however, accurate encoding remains cIRCAM:Galleryenging. While traditional linear encoders are signal-independent and robust, they amplify low-frequency noise and suffer from high-frequency spatial aliasing. On the other hand, neural network approaches can outperform linear encoders but they often assume idealized microphones and may perform inconsistently in real-world scenarios. To leverage their complementary strengths, we introduce a residual-learning framework that refines a linear encoder with corrections from a neural network. Using measured array transfer functions from smartglasses, we compare a UNet-based encoder from the literature with a new recurrent attention model. Our analysis reveals that both neural encoders only consistently outperform the linear baseline when integrated within the residual learning framework. In the residual configuration, both neural models achieve consistent and significant improvements across all tested metrics for in-domain data and moderate gains for out-of-domain data. Yet, coherence analysis indicates that all neural encoder configurations continue to struggle with directionally accurate high-frequency encoding.
In this work, the authors evaluate a higher-order Ambisonic (HOA) renderer that compensates for reverberant characteristics of the intended listening room; this is accomplished by decoding a HOA signal to control points distributed around a boundary surrounding the listening area, then convolving the control signal with a compensation filter derived via matrix inversion of room impulse responses (RIR) from loudspeakers to control points in the frequency domain. First, a comParison is performed over renderers utilizing increasing control point density and evaluated using simulated RIRs. Then, robustness of the renderer to simulation inaccuracy is evaluated experimentally in a listening room. Metrics of reconstructed soundfield directionality and reverberation are compared to those obtained from a conventional HOA decoder, and results demonstrate an increase in source directivity, and a reduction in reverberation time for both directional and diffuse stimuli.
The tools for building social capital in any career are based on networking, mentorship, and role models. For immersive audio, underrepresented groups are upskilling, teaching others, and innovating in order to pursue their ambitions. Dr. Leslie Gaston-Bird talks about her initiative "Immersive and Inclusive Audio", which has been running for over five years, and how the Pro Tools | Dolby Atmos Certification plays a role in the efforts of women and minorities to "leak up, not out" of the immersive audio career pipeline.
Higher-Order Ambisonics (HOA) reproduction with conventional mode-matching decoders can exhibit the so-called “ring of silence,” characterised by sound level reduction in specific spatial or spectral regions. This effect arises in loudspeaker reproduction when the number of loudspeakers exceeds that required by the Ambisonic order, and in binaural rendering when head-related transfer functions (HRTFs) are sampled at a higher spatial resolution than supported by the input signal. This paper investigates the extent to which advanced Ambisonic decoding strategies can mitigate this artefact. In particular, decoders based on Lasso regularisation and magnitude least-squares (magLS) are evaluated through numerical simulations in both loudspeaker and binaural reproduction scenarios. The results show that both approaches significantly reduce the prominence of the ring of silence compared to conventional minimum-norm mode-matching decoders. In loudspeaker reproduction, a more uniform spatial distribution of SPL is obtained, while in binaural rendering, spectral consistency is improved. An interpretation of these results is proposed, linking the observed behaviour to the underlying optimisation criteria of the decoding process. The results indicate that the ring of silence is not an inherent limitation of Ambisonics, but rather a consequence of the decoding strategy, and can be effectively mitigated through appropriate decoder design.
Immersive audio production for XR, virtual environments, and live performance is increasingly defined by a diversity of spatial formats and rendering systems, including Ambisonics and object-based approaches. While these enable complex spatial experiences, they also result in fragmented workflows and limited interoperability across production and playback contexts. This workshop explores spatial audio as a flexible and transferable practice rather than a system-bound process. It introduces Grapes 3D Audio Control as a system-independent control approach that enables users to work with spatial audio across different environments without being tied to a specific rendering pipeline. Participants will engage in hands-on exercises to create, control, and adapt spatial audio scenes across multiple contexts, including XR applications, media installations, and live setups. The focus lies on maintaining spatial intent while working across heterogeneous systems and technical conditions. The workshop combines practical exploration with short demonstrations and structured discussion. It explicitly creates space for exchange on different workflows and production strategies, bringing together perspectives from sound design, audio engineering, live operation, and XR development. By focusing on interoperability, workflow design, and real-world application, the workshop aims to provide participants with practical strategies for working with spatial audio across systems, while contributing to a broader discussion on how immersive audio production can become more flexible, portable, and sustainable.
The SONICOM Ecosystem is a repository dedicated to spatial hearing and binaural audio. It provides means to store data as databases and tools (including their metadata), to create relations between them, and to enable specific data visualization tailored to the needs of the auditory community. It also enables persistent publications via digital object identifiers (DOIs) and supports the authors along their typical process of publishing scientific articles. In this workshop, we will guide the participants through the key features of the SONICOM Ecosystem and show how the Ecosystem can support researchers during their publication workflow.
Binaural signal synthesis is typically formulated as forward modelling using head-related transfer functions (HRTFs). We explore an inverse auditory modelling perspective in which binaural ear signals are estimated directly from a source signal and its azimuth. We present a lightweight complex-valued neural network that predicts frequency-domain binaural filters from the input source spectrum and azimuthal direction, which are then applied to synthesize binaural signals. Controlled experiments evaluate how excitation bandwidth and angular sampling density affect reconstruction and generalization. Results show accurate spectral reconstruction and interpolation to unseen source directions even when training uses sparse angular grids, while bandwidth strongly influences problem conditioning and error behaviour. This work focuses on characterizing compact signal-conditioned inverse models as efficient components for binaural signal generation.
Single-sided deafness (SSD) reduces access to binaural cues and can make spatial-audio localization difficult in virtual reality (VR). This study investigated short-term localization training under simulated SSD in a VR task using generic, non-individualized head-related transfer function (HRTF) rendering with head-movement-contingent auditory updating, and examined whether an enhanced HRTF could improve performance by emphasizing monaurally available spectral cues at the better-hearing ear. The rationale was that, although directional judgment in normal binaural listening depends strongly on interaural differences, monaural listening must rely more heavily on direction-dependent spectral characteristics that remain available at the better-hearing ear. Twenty normal-hearing participants performed a 13-source horizontal-plane localization task using a VR headset and headphones under simulated SSD. Participants were assigned to either normal-HRTF training or enhanced-HRTF training (n = 10 each). The experiment comprised pre-test, three training sessions, and post-test, and all participants were tested with both normal and enhanced HRTFs, yielding four train-test combinations. Performance was evaluated using accuracy (ACC), mean absolute error (MAE), and response time (RT). Localization performance improved with training under the present VR simulated-SSD condition. ACC increased and MAE decreased from pre-test to post-test, whereas RT showed no clear change. No significant overall between-group difference in cumulative improvement was observed. However, during training, the enhanced-HRTF group showed a significant first-session advantage, and matched train-test combinations showed descriptively larger gains than mismatched combinations. These results suggest that short-term VR localization training can improve directional judgment under simulated SSD and that enhancing monaural spectral cues may provide an early benefit by making direction-specific patterns easier to associate with source direction. The findings are limited to localization performance in the present VR task under simulated SSD and should not be directly generalized to clinical SSD populations, real-world auditory rehabilitation, or broader everyday 3D spatial-audio experience.
Head-related transfer functions (HRTFs) are fundamental to spatial audio via binaural rendering. Personalized HRTFs have been shown to improve localization accuracy and reduce perceptual artifacts and directional ambiguities. However, acquiring such HRTFs is time-consuming and requires costly measurement setups. To address this limitation, this article investigates the use of deep learning models to estimate personalized HRTFs from ear shape representations. We propose and evaluate three different architectures with various types of input data and identify the minimum achievable spectral distance error when predicting true HRTFs magnitude spectra. The best model we evaluated achieves a test Log Spectral Distortion (LSD) of 4.93 dB. We also established a performance ranking based on input data types and architectural choices.
This paper describes an experiment to investigate how the localisation performance of a neural network for Sound Source Localisation named `SampleDOA\_SR' would be affected by reducing the sample rate of the audio training data. Reducing the sample rate has several benefits; most notably a reduction in training time. The goal is to determine an appropriate sample rate which balances both localisation accuracy and training time. This information will be used to inform the future training of a neural network for Sound Source Localisation which will be used in a stereo upmixing pipeline. The results of this experiment indicate reducing the sample rate from 48kHz down to below 4kHz results in a significant decrease in localisation accuracy. However, above 4kHz, the decrease in localisation accuracy is minimal whilst training time is reduced significantly. This suggests providing the particular application for the model does not require the highest level of accuracy, a minimal reduction in localisation performance may be acceptable to obtain a large reduction in training time which would also reduce the environmental impact of the model training. A sample rate of 16kHz is suggested as a suitable balance between accuracy and training time.
Binaural hearing supports effective communication in complex acoustic environments by enabling listeners to segregate spatially separated sound sources, a benefit referred to as spatial release from masking (SRM). The spatial cues that give rise to SRM are determined by the head-related transfer function (HRTF). Although individual HRTFs are generally considered optimal for accurate localisation, prior work suggests they do not necessarily maximise performance across all aspects of spatial perception, including SRM. This motivates the concept of application-specific HRTFs. Here, we propose an application-specific HRTF augmentation method to improve speech intelligibility in cocktail-party scenarios, focusing on front–back configurations where SRM is limited. HRTFs are parameterised using principal component analysis and optimised via a differentiable auditory-model-based objective to enhance spectral cues while constraining interaural level differences. The method yields model-predicted SRM gains of 4–9 dB without inducing substantial predicted lateralisation artefacts.
This study is motivated by an ambition to determine the ‘best’-matching HRTFs during an onboarding task for an audio-only virtual reality (VR) experience using a ‘shooting down sound sources’ task. The study is motivated by the needs of blind and visually impaired gamers, who may rely more crucially on accurate rendering of auditory spatial cues for succeeding in the audio-only VR experience. We present an exploratory study applying an experimental VR test platform that renders ‘target’ sound sources in a virtual environment and logs tracking characteristics of head, hand-held controller and body while participants localise and ‘shoot’ audible ‘targets’ that are visible (for task familiarisation) and invisible. Four game-relevant sound stimuli and three different HRTFs were tested across eight sessions on two separate days. In this study, we show data collected from fifteen seeing participants, which demonstrate an ability to localise the sound sources accurately. The tracking data suggests various search patterns (e.g. hemisphere swaps and direction reversals) associated with ‘weak’ localisation cues and possible ambiguities. The search patterns are likely all quantifiable via angular error, response time, path length, search directions, number of reversals, and search speed as determined from the tracking characteristics.
This study introduces needlets, a specific class of spherical wavelets, for spatial audio applications. Needlets are constructed in the spherical harmonic domain, are mathematically well defined, possess good localisation properties, and facilitate multiresolution analysis. However, because they form a tight frame, they are redundant and therefore require sparsification for practical applications. We propose a comprehensive spatial audio framework based on needlets, spanning encoding through to head-tracking-enabled binaural rendering. In this framework, a sound scene is encoded into a redundant needlet dictionary, which is subsequently sparsified using a novel algorithm. The resulting sparse representation is then decoded for headphone reproduction. Scene rotation is achieved by applying SO(3) rotation matrices to the sparse representation. The perceptual implications of the framework’s design parameters were evaluated using objective metrics and compared with those of Ambisonics. Initial results show that the proposed framework can achieve better tonal and spatial fidelity than third- and fourth-order Ambisonics Magnitude Least-Squares decoding while using a similar number of channels. Moreover, the proposed framework has been shown to allow users to tune the reproduced sound scene while maintaining fidelity.
Source directivity constitutes a fundamental acoustic property of musical instruments, describing the variation of radiated sound pressure as a function of direction. This behavior is dependent on the geometry, material properties, and excitation mechanisms of the instrument, and plays an important role in spatial sound perception. In the real world, the directional characteristics of a source contribute significantly to how sound is localized, how timbre is perceived across different listening positions, how sound is captured with different microphone techniques and placements, and how sound interacts with the surrounding environment. Yet, despite its importance, source directivity is often simplified or neglected in contemporary spatial audio rendering approaches, particularly within AR/VR/XR applications where computational constraints and system complexity frequently dictate design choices. Directivity describes the angular dependence of radiated sound pressure and constitutes a defining acoustic signature of each instrument. Acoustic directivity measurements are based on demanding and carefully controlled procedures. Typically, they are conducted in anechoic or low-reverberation environments using dense microphone arrays, and rely on excitation mechanisms, in order to improve measurement accuracy and repeatability. It should be acknowledged, however, that there exists a gap between acoustic research and its practical integration into immersive media technologies. Many current XR applications rely on simplified or generic source models, prioritizing computational efficiency and ease of implementation over acoustic accuracy. While there is a clear benefit on the use of simplified directivity approaches, such practices reduce the perceptual realism and fidelity of the reproduced sound field. This raises critical questions: To what extent does accurate directivity contribute to perceptual realism? Are approximations sufficient, and under what conditions do they compromise the experience? This workshop addresses these questions by exploring both the scientific foundations and practical implications of incorporating source directivity into AR/VR/XR systems. It is structured in three parts, offering theoretical information and practical perspectives on the role of sound source directivity in immersive audio applications. The first part discusses source directivity and its importance in sound emission, perception, and spatial realism. Emphasis will be given on recent research involving the capture and analysis of directivity patterns of the human signing voice across different music genres and traditional Greek musical instruments. Two directivity databases dedicated to this research, which are publicly available through the SONICOM Ecosystem repository (https://ecosystem.sonicom.eu/) will be also presented, along with an overview of their structure, content, and potential applications. The second part focuses on the integration of directivity data into spatial audio rendering pipelines for AR/VR/XR environments. Participants will be introduced to the latest updates of the SOFA (Spatially Oriented Format for Acoustics) conventions specifically created for storing and exchanging directivity information. In addition, the Binaural Rendering Toolbox (BRT), developed within the SONICOM project, will be presented as a practical tool that facilitates the implementation of directivity-aware rendering workflows. The third part concerns a critical discussion on the practical implications of using accurate or approximated directivity data in immersive audio applications. Drawing on results from selected case studies, the session will evaluate the perceptual and computational trade-offs involved, offering guidance on when high-precision data is necessary and when simplified models may suffice in AR/VR/XR applications.
Pressure-matching (PM) for personal sound zone (PSZ) can achieve high contrast at nominal control points, but the performance may degrade when transfer functions are mismatched. We introduce a neural method that maps transfer functions to loudspeaker weights using a single-frequency input network with parameters shared across frequencies. We evaluate the robustness under position shifts, additive transfer-function noise, and added reflections, and compare against PM with Tikhonov regularization. Results show improved robustness to structured perturbations such as listener displacement, whereas regularized PM remains more resilient to unstructured random transfer-function noise and reverberation. We further explain these results using a singular value decomposition based perturbation projection. Finally, we analyze different regularization mechanisms induced by the network and derive practical guidelines for neural PSZ filter optimization.
To enable dynamic control in transaural personal sound zone (PSZ) systems, accurate binaural room impulse responses (BRIRs) at various listener positions are needed. Since it is impractical to measure BRIRs at all possible positions, interpolation from a sparse set of measured positions can be used. Although numerous BRIR interpolation methods exist, their effectiveness in sound field control applications remains unclear. In this paper, we propose a sub-band interpolation method that combines linear interpolation for frequencies lower than 2000 Hz with sinusoidal representation networks for frequencies higher than 2000 Hz. The interpolated BRIRs are then applied in a PSZ control system. Simulation results demonstrate that this hybrid approach significantly improves system performance at a wider frequency range.
Special premiere projection of Vaulted Harmonies 360° for AVARIG attendees at the planetarium, Science and Industry Museum. Address : 30, avenue Corentin-Cariou Paris 19e. AVARIG provided ticket required for entry.