Name: (P) Complex Ratio Mask Ambisonics-to-Binaural Rendering with Intensity Vector Features and Perceptual Multi-Objective Optimization
Start: 2026-07-01T10:00:00+0200
End: 2026-07-01T12:30:00+0200

Schedule as of May 2026 - subject to change

Default Time Zone is EDT - Eastern Daylight Time

(P) Complex Ratio Mask Ambisonics-to-Binaural Rendering with Intensity Vector Features and Perceptual Multi-Objective Optimization

Wednesday July 1, 2026 10:00am - 12:30pm CEST

Jussieu:Room 1

This paper presents a neural network architecture for binaural rendering of first-order Ambisonics (FOA) signals, enabling headphone listeners to perceive immersive spatial audio from Ambisonic content without requiring individualized Head-Related Transfer Function measurements at the inference time. The model operates in the STFT domain using Complex Ratio Masks (CRM). Unlike magnitude-mask methods that process only the omnidirectional channel and discard phase, the proposed model predicts a shared CRM pair (left and right ear) applied via complex-valued multiplication to all four FOA channels with directional weighting. The omnidirectional channel W contributes at unit weight while directional channels Y, Z, X are weighted at a reduced level, preserving both magnitude and phase information from the full soundfield. The input representation extends standard spectral features with three intensity vector channels that encode sound arrival direction at each time-frequency bin, providing the network with explicit spatial information alongside magnitude and phase cues. Training uses a multi-objective loss that combines waveform-level accuracy (SI-SDR), multi-resolution spectral reconstruction at three complementary time-frequency scales, and interaural level and phase difference terms to jointly optimize signal fidelity and spatial cue preservation. The encoder-decoder backbone is a four-level UNet with residual convolutional blocks and channel-spatial attention at every level, totaling approximately four million parameters. Evaluation against a prior magnitude-masking architecture with 28 million parameters shows that the CRM variant achieves comparable spatial cue preservation with a seven-fold parameter reduction while gaining access to phase information. Processing the signal in a single STFT-domain forward pass avoids the sequential inference of autoregressive time-domain models, yielding computational efficiency suitable for real-time virtual reality deployment

Speakers

Szymon Zaporowski

Teaching and Research Assistant, Gdańsk University of Technology

Researcher at the Department of Multimedia Systems, Gdańsk University of Technology, with a focus on audio machine learning, psychoacoustics, signal processing, automatic speech recognition, and deepfake audio detection. His work sits at the intersection of immersive audio and AI... Read More →

Bartłomiej Mróz

Wednesday July 1, 2026 10:00am - 12:30pm CEST
Jussieu:Room 1 4, place Jussieu Paris 5e

HOA, Poster

AES 2026 AVARIG Conference

Szymon Zaporowski

Bartłomiej Mróz

Get help with the event

AES 2026 AVARIG Conference

Szymon Zaporowski

Bartłomiej Mróz

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event