This paper presents a neural network architecture for binaural rendering of first-order Ambisonics (FOA) signals, enabling headphone listeners to perceive immersive spatial audio from Ambisonic content without requiring individualized Head-Related Transfer Function measurements at the inference time. The model operates in the STFT domain using Complex Ratio Masks (CRM). Unlike magnitude-mask methods that process only the omnidirectional channel and discard phase, the proposed model predicts a shared CRM pair (left and right ear) applied via complex-valued multiplication to all four FOA channels with directional weighting. The omnidirectional channel W contributes at unit weight while directional channels Y, Z, X are weighted at a reduced level, preserving both magnitude and phase information from the full soundfield. The input representation extends standard spectral features with three intensity vector channels that encode sound arrival direction at each time-frequency bin, providing the network with explicit spatial information alongside magnitude and phase cues. Training uses a multi-objective loss that combines waveform-level accuracy (SI-SDR), multi-resolution spectral reconstruction at three complementary time-frequency scales, and interaural level and phase difference terms to jointly optimize signal fidelity and spatial cue preservation. The encoder-decoder backbone is a four-level UNet with residual convolutional blocks and channel-spatial attention at every level, totaling approximately four million parameters. Evaluation against a prior magnitude-masking architecture with 28 million parameters shows that the CRM variant achieves comparable spatial cue preservation with a seven-fold parameter reduction while gaining access to phase information. Processing the signal in a single STFT-domain forward pass avoids the sequential inference of autoregressive time-domain models, yielding computational efficiency suitable for real-time virtual reality deployment
Teaching and Research Assistant, Gdańsk University of Technology
Researcher at the Department of Multimedia Systems, Gdańsk University of Technology, with a focus on audio machine learning, psychoacoustics, signal processing, automatic speech recognition, and deepfake audio detection. His work sits at the intersection of immersive audio and AI... Read More →
This study investigates the perceptually sufficient ambisonic order for beamforming in complex acoustic scenes, defined as the minimum spatial resolution above which no audible improvement is perceived. Two beamforming methods were evaluated: hypercardioid and MVDR beamforming. In contrast to previous studies, the case of an ideal microphone array was considered, in order to the evaluate the beamforming methods independently of ambisonic encoding error. Sound scenes were generated using room acoustic simulations and encoded into ambisonic signals. A perceptual evaluation was conducted using a three-interval/two-alternative forced choice (3I/2AFC) test design with an adaptive procedure. The experiment used a production-constrained reference (7th-order) and a high-order reference (19th-order). Results showed that the required order would depend on the beamforming method and characteristics of the sound scene. Diffuseness profiles can be used to analyze the influence of the ambisonic order on the sound field diffuseness and to evaluate whether the directional information available is sufficient to support effective adaptive beamforming.
Ambisonics is a scene-based spatial audio format that has been around since the 1970s. In recent years its popularity has increased, with inclusion in game engines (such as Unity and Unreal) and distribution standards (like ADM or IAMF). Despite this, many practitioners view Ambisonics as being complex, mathematical, or academic. This tutorial explains Ambisonics through the lens of practical decision-making. Instead of equations, it covers the choices audio professionals are required to make when working on a project, with a particular emphasis on the audible consequences of those choices. The tutorial enables attendees to develop an intuitive understanding of Ambisonics through explanations of theory, combined with listening examples and workflow demonstrations. The topics covered in this tutorial are: • Fundamentals: What Ambisonics is and how it differs from channel-based and object-based formats, and why it is well suited to VR, AR, and game audio. • Encoding: How audio sources are converted to Ambisonics, and how to choose the Ambisonic order based on perceptual and computational trade-offs, as well as delivery constraints. • Conventions: Common channel ordering (ACN vs FuMa) and gain normalisation (SN3D vs N3D) conventions, and what happens when things get mismatched. • Processing: What kinds of effects can be used on Ambisonic signals while preserving the spatial integrity. • Decoding and binaural rendering: How Ambisonic signals are converted to loudspeaker or binaural signals. The impact of head-tracking and HRTF selection on the binaural rendering. • Mixed-order projects: What the options are when working with mixed-order sources, and the audible artefacts that can arise. The tutorial will provide brief practical demonstrations of setting up an Ambisonics project in Pro Tools and Reaper, two widely used DAWs for immersive audio. By the end of the tutorial attendees will have a practical understanding of the main concepts of Ambisonics, as well as knowing how the practical choices they make will impact the final audio. They will also be familiar with the main workflow pitfalls and how to avoid them. The tutorial assumes familiarity with general audio production concepts (DAW use, signal routing, mixing). However, no prior experience with Ambisonics or spatial audio formats is required. It is suitable for sound designers, composers, and audio engineers working in or interested in immersive media.
Emerging wearable devices such as smartglasses and extended reality headsets demand high-quality spatial audio capture from compact, head-worn microphone arrays. Ambisonics provides a device-agnostic spatial audio representation by mapping array signals to spherical harmonic (SH) coefficients. In practice, however, accurate encoding remains cIRCAM:Galleryenging. While traditional linear encoders are signal-independent and robust, they amplify low-frequency noise and suffer from high-frequency spatial aliasing. On the other hand, neural network approaches can outperform linear encoders but they often assume idealized microphones and may perform inconsistently in real-world scenarios. To leverage their complementary strengths, we introduce a residual-learning framework that refines a linear encoder with corrections from a neural network. Using measured array transfer functions from smartglasses, we compare a UNet-based encoder from the literature with a new recurrent attention model. Our analysis reveals that both neural encoders only consistently outperform the linear baseline when integrated within the residual learning framework. In the residual configuration, both neural models achieve consistent and significant improvements across all tested metrics for in-domain data and moderate gains for out-of-domain data. Yet, coherence analysis indicates that all neural encoder configurations continue to struggle with directionally accurate high-frequency encoding.
In this work, the authors evaluate a higher-order Ambisonic (HOA) renderer that compensates for reverberant characteristics of the intended listening room; this is accomplished by decoding a HOA signal to control points distributed around a boundary surrounding the listening area, then convolving the control signal with a compensation filter derived via matrix inversion of room impulse responses (RIR) from loudspeakers to control points in the frequency domain. First, a comParison is performed over renderers utilizing increasing control point density and evaluated using simulated RIRs. Then, robustness of the renderer to simulation inaccuracy is evaluated experimentally in a listening room. Metrics of reconstructed soundfield directionality and reverberation are compared to those obtained from a conventional HOA decoder, and results demonstrate an increase in source directivity, and a reduction in reverberation time for both directional and diffuse stimuli.
Higher-Order Ambisonics (HOA) reproduction with conventional mode-matching decoders can exhibit the so-called “ring of silence,” characterised by sound level reduction in specific spatial or spectral regions. This effect arises in loudspeaker reproduction when the number of loudspeakers exceeds that required by the Ambisonic order, and in binaural rendering when head-related transfer functions (HRTFs) are sampled at a higher spatial resolution than supported by the input signal. This paper investigates the extent to which advanced Ambisonic decoding strategies can mitigate this artefact. In particular, decoders based on Lasso regularisation and magnitude least-squares (magLS) are evaluated through numerical simulations in both loudspeaker and binaural reproduction scenarios. The results show that both approaches significantly reduce the prominence of the ring of silence compared to conventional minimum-norm mode-matching decoders. In loudspeaker reproduction, a more uniform spatial distribution of SPL is obtained, while in binaural rendering, spectral consistency is improved. An interpretation of these results is proposed, linking the observed behaviour to the underlying optimisation criteria of the decoding process. The results indicate that the ring of silence is not an inherent limitation of Ambisonics, but rather a consequence of the decoding strategy, and can be effectively mitigated through appropriate decoder design.
Channel-based mixing has long been the standard paradigm for audio professionals in both studio and live performance contexts, owing to its intuitive, signal-oriented workflow. While this approach excels in conventional stereo and multichannel formats, it offers limited native support for advanced spatial applications. As immersive audio formats become increasingly popular for virtual and augmented reality, new mechanisms are needed to allow engineers to work directly within spatial domains rather than adapting channel-based setups to fit immersive content. This paper presents an interactive software system for mixing First-Order Ambisonic (FOA) soundfields in real time. The system accepts A-format recordings from tetrahedral microphones, including the Sennheiser Ambeo and the Core Sound TetraMic. It converts them to B-format using a directional beamforming approach, in which the individual dimensions (left, right, front, back, top, bottom) are independently crossfaded between sources. Each directional beam is extracted via a dot product with a corresponding spherical harmonic steering vector, crossfaded between the two input soundfields, and reencoded into a new B-format using an outer product reconstruction. The program architecture is designed to accommodate future extensions to additional Ambisonic microphone formats, including the Røde SoundField, the Zylia ZM-1, and the Eigenmike. This positions the platform toward a microphone-agnostic encoding and mixing platform. By parameterizing the encoding stage through a gain-factored matrix, the directional steering vectors can be reconfigured to reflect the capsule geometry of any tetrahedral target microphone array. Future iterations of the program can extend the beamforming framework beyond the first-order WXYZ dimensions, enabling the integration of higher-order ambisonic encoding to improve spatial resolution and directional accuracy. The primary contribution of this program is a streamlined interface for ambisonic processing, designed to make scene-based mixing accessible to engineers and sound designers working in immersive audio production.