In the September 1975 issue of the JAES, Michael Gerzon summarized the benefits of emerging Ambisonic audio technology and its potential for the spatial capture of concert IRCAM:Gallery impulse responses in his short paper Recording Concert IRCAM:Gallerys for Posterity. Gerzon was reflecting upon the cIRCAM:Galleryenge that Richard Heyser wrote about in 1974, of defining and encoding into our recordings “a Rosetta Stone” signal that would allow us to unlock the data embedded with the audio and so undo “the spectral, spatial and dynamics limitations of a recording of a great artist” (Heyser, JAES, May 1974). Angelo Farina, in collaboration with Waves in 2003, also presented a paper titled Recording Concert IRCAM:Gallerys for Posterity. This paper was named in honour of Gerzon and reported on the development of a high-quality library of spatial room impulse responses from famous concert IRCAM:Gallerys and opera houses around the world, just as Gerson and Heyser had envisaged. Arguably this paper reintroduced Ambisonic technology and spherical harmonics as a compact, efficient and flexible means of encoding and decoding an acoustic environment for use in modern music production using emerging real-time convolution audio effects plugins. However, it was the arrival of the Oculus Rift Virtual Reality headset in 2013 that really started to bring the interactive benefits of Ambisonics to a much wider audience through new audio workflows and game engine development platforms. This interactive game engine technology has in turn unlocked cIRCAM:Galleryenges and brought new opportunities in how creatives envisage and build future immersive extended reality (XR) experiences, and these foundational Ambisonic tools have become a fundamental part of our audio programming pipelines and sound design workflows. This XR Futures presentation will reflect upon these and related developments in immersive audio for virtual/augmented reality and immersive games, and how research at the University of York’s AudioLab has taken a parallel path into future extended reality (XR) experience design through the XR Stories and CoSTAR Live Lab projects. What role does our “Rosetta Stone for audio” now have in unlocking the potential of future extended reality experiences?
Emerging wearable devices such as smartglasses and extended reality headsets demand high-quality spatial audio capture from compact, head-worn microphone arrays. Ambisonics provides a device-agnostic spatial audio representation by mapping array signals to spherical harmonic (SH) coefficients. In practice, however, accurate encoding remains cIRCAM:Galleryenging. While traditional linear encoders are signal-independent and robust, they amplify low-frequency noise and suffer from high-frequency spatial aliasing. On the other hand, neural network approaches can outperform linear encoders but they often assume idealized microphones and may perform inconsistently in real-world scenarios. To leverage their complementary strengths, we introduce a residual-learning framework that refines a linear encoder with corrections from a neural network. Using measured array transfer functions from smartglasses, we compare a UNet-based encoder from the literature with a new recurrent attention model. Our analysis reveals that both neural encoders only consistently outperform the linear baseline when integrated within the residual learning framework. In the residual configuration, both neural models achieve consistent and significant improvements across all tested metrics for in-domain data and moderate gains for out-of-domain data. Yet, coherence analysis indicates that all neural encoder configurations continue to struggle with directionally accurate high-frequency encoding.
In this work, the authors evaluate a higher-order Ambisonic (HOA) renderer that compensates for reverberant characteristics of the intended listening room; this is accomplished by decoding a HOA signal to control points distributed around a boundary surrounding the listening area, then convolving the control signal with a compensation filter derived via matrix inversion of room impulse responses (RIR) from loudspeakers to control points in the frequency domain. First, a comParison is performed over renderers utilizing increasing control point density and evaluated using simulated RIRs. Then, robustness of the renderer to simulation inaccuracy is evaluated experimentally in a listening room. Metrics of reconstructed soundfield directionality and reverberation are compared to those obtained from a conventional HOA decoder, and results demonstrate an increase in source directivity, and a reduction in reverberation time for both directional and diffuse stimuli.
Higher-Order Ambisonics (HOA) reproduction with conventional mode-matching decoders can exhibit the so-called “ring of silence,” characterised by sound level reduction in specific spatial or spectral regions. This effect arises in loudspeaker reproduction when the number of loudspeakers exceeds that required by the Ambisonic order, and in binaural rendering when head-related transfer functions (HRTFs) are sampled at a higher spatial resolution than supported by the input signal. This paper investigates the extent to which advanced Ambisonic decoding strategies can mitigate this artefact. In particular, decoders based on Lasso regularisation and magnitude least-squares (magLS) are evaluated through numerical simulations in both loudspeaker and binaural reproduction scenarios. The results show that both approaches significantly reduce the prominence of the ring of silence compared to conventional minimum-norm mode-matching decoders. In loudspeaker reproduction, a more uniform spatial distribution of SPL is obtained, while in binaural rendering, spectral consistency is improved. An interpretation of these results is proposed, linking the observed behaviour to the underlying optimisation criteria of the decoding process. The results indicate that the ring of silence is not an inherent limitation of Ambisonics, but rather a consequence of the decoding strategy, and can be effectively mitigated through appropriate decoder design.
Immersive audio production for XR, virtual environments, and live performance is increasingly defined by a diversity of spatial formats and rendering systems, including Ambisonics and object-based approaches. While these enable complex spatial experiences, they also result in fragmented workflows and limited interoperability across production and playback contexts. This workshop explores spatial audio as a flexible and transferable practice rather than a system-bound process. It introduces Grapes 3D Audio Control as a system-independent control approach that enables users to work with spatial audio across different environments without being tied to a specific rendering pipeline. Participants will engage in hands-on exercises to create, control, and adapt spatial audio scenes across multiple contexts, including XR applications, media installations, and live setups. The focus lies on maintaining spatial intent while working across heterogeneous systems and technical conditions. The workshop combines practical exploration with short demonstrations and structured discussion. It explicitly creates space for exchange on different workflows and production strategies, bringing together perspectives from sound design, audio engineering, live operation, and XR development. By focusing on interoperability, workflow design, and real-world application, the workshop aims to provide participants with practical strategies for working with spatial audio across systems, while contributing to a broader discussion on how immersive audio production can become more flexible, portable, and sustainable.
This study introduces needlets, a specific class of spherical wavelets, for spatial audio applications. Needlets are constructed in the spherical harmonic domain, are mathematically well defined, possess good localisation properties, and facilitate multiresolution analysis. However, because they form a tight frame, they are redundant and therefore require sparsification for practical applications. We propose a comprehensive spatial audio framework based on needlets, spanning encoding through to head-tracking-enabled binaural rendering. In this framework, a sound scene is encoded into a redundant needlet dictionary, which is subsequently sparsified using a novel algorithm. The resulting sparse representation is then decoded for headphone reproduction. Scene rotation is achieved by applying SO(3) rotation matrices to the sparse representation. The perceptual implications of the framework’s design parameters were evaluated using objective metrics and compared with those of Ambisonics. Initial results show that the proposed framework can achieve better tonal and spatial fidelity than third- and fourth-order Ambisonics Magnitude Least-Squares decoding while using a similar number of channels. Moreover, the proposed framework has been shown to allow users to tune the reproduced sound scene while maintaining fidelity.
Pressure-matching (PM) for personal sound zone (PSZ) can achieve high contrast at nominal control points, but the performance may degrade when transfer functions are mismatched. We introduce a neural method that maps transfer functions to loudspeaker weights using a single-frequency input network with parameters shared across frequencies. We evaluate the robustness under position shifts, additive transfer-function noise, and added reflections, and compare against PM with Tikhonov regularization. Results show improved robustness to structured perturbations such as listener displacement, whereas regularized PM remains more resilient to unstructured random transfer-function noise and reverberation. We further explain these results using a singular value decomposition based perturbation projection. Finally, we analyze different regularization mechanisms induced by the network and derive practical guidelines for neural PSZ filter optimization.
To enable dynamic control in transaural personal sound zone (PSZ) systems, accurate binaural room impulse responses (BRIRs) at various listener positions are needed. Since it is impractical to measure BRIRs at all possible positions, interpolation from a sparse set of measured positions can be used. Although numerous BRIR interpolation methods exist, their effectiveness in sound field control applications remains unclear. In this paper, we propose a sub-band interpolation method that combines linear interpolation for frequencies lower than 2000 Hz with sinusoidal representation networks for frequencies higher than 2000 Hz. The interpolated BRIRs are then applied in a PSZ control system. Simulation results demonstrate that this hybrid approach significantly improves system performance at a wider frequency range.
The Vaulted Harmonies project reconstructs the acoustic and musical heritage of Notre-Dame de Paris through immersive audio-visual experiences spanning multiple centuries of the cathedral's architectural evolution, developed as part of the Past Has Ears at Notre-Dame (PHEND) research project. Building upon the dome screening presented earlier at AVARIG 2026, this presentation focuses on the spatial audio production workflow underpinning the project and the perceptual trade-offs involved in adapting it across dissemination formats. Starting from room impulse responses derived from geometrical acoustic simulations and convolved with anechoic multichannel musical recordings using RoomZ, a dynamic room impulse response panner developed as part of the project, higher-order ambisonic renderings were generated through continuous interpolation along cinematically designed camera trajectories. The 360° dome version, premiered at the Planetarium of the Cité des Sciences et de l'Industrie, decoded these renderings to a 5.1+1 loudspeaker layout. This presentation complements that screening by examining the underlying HOA production chain and presenting selected scenes in third-order ambisonic reproduction, enabling direct perceptual comParison between the two output formats, though in a conventional frontal video projection context. Topics discussed include the design of spatially coherent auralisation trajectories, maintaining a coherent audio-visual narrative across successive historical reconstructions of the cathedral, and the trade-offs related to output format and deployment context, situated within a broader workflow designed to support wide public dissemination of immersive heritage experiences.
Numerous approaches have been taken to address the problem of generating navigable virtual models for multi-volume acoustic spaces. The general practice for creating empirically informed interactive models of multi-volume acoustic spaces, as embodied by the Spatially Oriented Format for Acoustics, is to discretely sample emitter-receiver pair positions. For a user to then navigate between these discrete positions involves cross-fading, blending, or otherwise perceptually interpolating between corresponding zones. This paper outlines a new approach which instead involves the continuous three-dimensional sampling of acoustic spaces, much as is done with 3D visual spaces in photogrammetry. To achieve this result, a first-of-its-kind consolidated ambisonic impulse response capturing apparatus has been designed and built. This apparatus combines a 3rd-order ambisonic microphone array with a 2nd-order ambisonic loudspeaker array and is designed to be moved through a space with maximal ease. AD/DA conversion, playback, and recording are all handled on a central compute platform. In parallel, a software workflow has been developed which can be implemented in Unreal Engine, as well as other game engines. To solve general issues of spatial audio in game engines, a custom encoding and decoding framework has been implemented. Then, to map the continuous ambisonic impulse response onto a virtual space, a spline mirroring the sampling path is drawn through the space. On the DSP side, an impulse response is extracted from any arbitrary point along the spline by way of the Common-slope Model for coupled spaces. Future work for better addressing early reflections and minimizing the theoretical intermediary of the Common-slope Model is discussed. Additionally, a special use case for visualizing acoustic energy in architectural acoustics is explored.
Despite significant advances in the development and adoption of spatial audio, many musicians do not embed the technology within their creative processes. Instead, spatial audio technologies are more often used to create immersive adaptations of fundamentally frontal compositions or performances. This paper presents and evaluates a means of spatial music making, referred to as the immersive drum circle. The system facilitates group performance and composition, in which participants stand in a circle and perform on electronic percussion pads, with sound spatialised so that the listener experiences the music as if positioned within the ensemble. The system’s design is presented alongside implementation details, as well as feedback from musicians obtained as part of an educational workshop which aimed to inform how spatial audio can be used creatively in music. In addition to interacting with the system, participants auditioned the resulting spatial music across three playback scenarios representing: gaming, tracked and non-tracked headphone-based music consumption, and a live concert environment. The results show that the immersive drum circle system is a viable tool for music creation and a practical means of inspiring future compositional techniques.
Channel-based mixing has long been the standard paradigm for audio professionals in both studio and live performance contexts, owing to its intuitive, signal-oriented workflow. While this approach excels in conventional stereo and multichannel formats, it offers limited native support for advanced spatial applications. As immersive audio formats become increasingly popular for virtual and augmented reality, new mechanisms are needed to allow engineers to work directly within spatial domains rather than adapting channel-based setups to fit immersive content. This paper presents an interactive software system for mixing First-Order Ambisonic (FOA) soundfields in real time. The system accepts A-format recordings from tetrahedral microphones, including the Sennheiser Ambeo and the Core Sound TetraMic. It converts them to B-format using a directional beamforming approach, in which the individual dimensions (left, right, front, back, top, bottom) are independently crossfaded between sources. Each directional beam is extracted via a dot product with a corresponding spherical harmonic steering vector, crossfaded between the two input soundfields, and reencoded into a new B-format using an outer product reconstruction. The program architecture is designed to accommodate future extensions to additional Ambisonic microphone formats, including the Røde SoundField, the Zylia ZM-1, and the Eigenmike. This positions the platform toward a microphone-agnostic encoding and mixing platform. By parameterizing the encoding stage through a gain-factored matrix, the directional steering vectors can be reconfigured to reflect the capsule geometry of any tetrahedral target microphone array. Future iterations of the program can extend the beamforming framework beyond the first-order WXYZ dimensions, enabling the integration of higher-order ambisonic encoding to improve spatial resolution and directional accuracy. The primary contribution of this program is a streamlined interface for ambisonic processing, designed to make scene-based mixing accessible to engineers and sound designers working in immersive audio production.
Spatial audio is spreading in applications such as virtual and augmented reality and immersive games. The higher-order ambisonic (HOA) format is particularly useful in this context. Transmitting spatial information requires multiple channels, e.g., 16 channels for third-order ambisonics, resulting in increased memory requirements for storage and higher bitrates for communication. Therefore, efficient compression algorithms are necessary for those contents. The recently standardized IVAS codec allows the coding of HOA content for communication use-cases. Here, we propose to evaluate it in comParison with a basic multi-mono approach across a variety of contents and spatialization methods. Results show that IVAS outperforms the multi-mono approach at the same bitrate. In particular, this codec exploits inter-channel correlation to reduce the bitrate. We point out that it is therefore especially robust for signals with a high interchannel correlation, such as those composed of a limited number of plane waves. Conversely, the multi-mono approach is unable to exploit this correlation and performs poorly on this type of signal.
A demonstration of a newly developed network spatial audio engine and its client software to show how an object-based audio performance can presented simultaneously locally and in a virtual venue. I’ll play multiple tracks of audio and position data from a laptop, as a surrogate for a local performance, and stream this object-based audio into a 6 DoF virtual audio space. Then show how a remote audience can join the same audio space via web browsers, listen to the music and explore the space. And finally I’ll show streaming back a resolved ambisonic mix of the original performance and the sounds of the remote audience into the auditorium and play it out. I'll walk it through and show how all audio and data transfer is done with simple data structures and standard non-proprietary, streaming protocols.
Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films,immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.
Spatial audio in extended reality (XR) has traditionally been framed as a localization tool, guiding users toward discrete virtual objects or events. This paper reframes this object-centered paradigm by presenting audio formgiving, an approach in which sound defines continuous zones demarcated by boundaries that users encounter through embodied movement. We present a mixed-reality study that investigates how participants perceive, reconstruct, and navigate such sound zones. We report our findings on reconstruction accuracy and boundary ambiguities across different sound zone shapes and sizes, and how movement trajectories relate to zone recognition, as well as participants’ strategies for navigating and identifying different types of sound zones.
Modern auditory rehabilitation faces significant cIRCAM:Galleryenges in speech discrimination within complex, noisy acoustic environments. The use of Augmented Reality interfaces based on "virtual sound objects" proposes the separation and selective enhancement of audio sources, while the Auracast standard (Bluetooth LE Audio) emerges as the ideal mechanism to distribute these independent streams with low latency. However, the advancement of such selective listening strategies is strictly limited by proprietary commercial ecosystems and a complete lack of open-source research platforms that adhere to the physical and power constraints of wearable devices. To bridge this gap, this work develops an open-source Auracast application on the Tiresias open-hardware platform, establishing an accessible "front-end" infrastructure for auditory interaction. The architecture was implemented on the Nordic nRF5340 SoC utilizing the Zephyr RTOS. Preliminary evaluations on a development kit successfully validated the protocol stack integration, demonstrating stream stability. Ongoing work focuses on porting the firmware to the Tiresias board and integrating the ADAU1787 audio codec, aiming to empirically quantify the end-to-end latency and energy efficiency of the embedded system.
While Virtual Reality offers transformative potential for immersive storytelling, the heavy reliance on visual stimuli often excludes Blind and Visually Impaired audiences. Conventional accessibility methods, such as linear Audio Description, frequently struggle to keep pace with the non-linear, explorative nature of virtual environments, resulting in an "accessibility chasm" where traditional two-dimensional solutions fail to support non-visual navigation. This research addresses these limitations through a User-Centred Design approach, centred on the thematic analysis of semi-structured focus groups involving twelve experienced Blind and Visually Impaired videogame players from the Royal National Institute of Blind People. The inquiry explored four themes: spatial sound navigation, audio description integration, haptic efficacy, and the social dimensions of virtual interfaces. Findings indicate that non-visual spatial exploration requires a multifaceted auditory system utilizing 3D-sound, predictable sound effects, and abstract sound signifiers, paired with a hybrid audio description model balancing functional and affective narration. To mitigate the risk of cognitive overload, participants identified haptic feedback as a critical tool for tactile confirmation and attentional guidance, serving as a non-auditory anchor that complements the primary soundscape. These user-led insights and real life examples seen on accessible video games inform the development of the ‘Description Spheres’: interactive virtual objects embedded within virtual environments that serve as multi-sensory hubs. By integrating spatialized audio, localized haptics, and experimental audio description, the system enables a transition to a dynamic, exploratory model that translates complex visual-spatial data into intuitive, non-visual sensory ecosystems, offering a scalable blueprint for inclusive design.