Immersive virtual reality (VR) is increasingly used to simulate concert experiences, yet it remains unclear whether its experiential advantages are accompanied by corresponding physiological changes when audiovisual content is held constant. The present study compared head-mounted immersive VR concert playback with tablet-based video in a fully counterbalanced within-subject design. Musical content and audio reproduction were identical across conditions, isolating the effect of visual immersion. Results showed consistently higher subjective ratings in VR across all measures, including music-induced affect, chills, perceived presence, and performance liking. In contrast, physiological differences were comparatively small: electrodermal activity showed only a modest increase in VR, and heart rate variability did not reliably differentiate between conditions. These findings suggest that immersive VR substantially enhances subjective music experience, particularly in terms of presence and affective engagement, while corresponding changes in autonomic physiology are limited under controlled conditions. The results indicate that subjective and physiological responses were differentially sensitive to the presentation-format manipulation.
Audio Augmented Reality (AAR) can be experienced through different navigation techniques that may influence presence and spatial perception. This paper investigates the effects of navigation type and listener perspective on exploration behavior, presence, and localization accuracy in AAR systems built with consumer hardware. A within-subjects study compared four conditions: virtual navigation, virtual navigation with head tracking, physical navigation, and physical navigation with head tracking. Fifteen participants completed exploration and sound localization tasks in each condition. Results show that physical navigation increased presence and improved exploration behavior and localization performance compared to virtual navigation, while head tracking paired with a non-individualized HRTF for binaural rendering did not produce significant effects.
Virtual Reality has emerged as a promising medium for high-stakes training, yet its predominantly visual design places disproportionate demands on attentional resources, limiting capacity for other task-relevant information. Spatial audio cues exploit the underutilized auditory channel to redistribute this load, with demonstrated improvements in reaction time, search efficiency, and situational awareness. However, when audio cues are spatially incongruent with visual targets, task performance degrades. The cognitive and behavioral costs of such incongruency, particularly under increasing visual complexity, remain underexplored. This pilot study examines how audiovisual spatial incongruency affects mental workload and task performance through a within-subjects VR experiment in which 15 participants complete a search-and-respond task across congruent and incongruent audiovisual conditions at three levels of visual complexity. Reaction time, target accuracy, timeouts, and subjective workload are measured across 10 trials per participant. Audiovisual incongruency is hypothesized to increase mental workload and impair performance, with effects amplified under higher visual complexity. Findings will inform spatial audio design for immersive training systems and motivate further investigation into tolerance thresholds for audiovisual misalignment.
This paper introduces MAV-C, an offline, signal-based framework for the joint objective estimation of Audio-Visual Complexity (AVC) in locally-rendered interactive games. MAV-C integrates entropy-based Acoustic Scene Complexity (ASC) features with multi-scale visual complexity metrics adapted to video via optical flow variance, and fuses modality-specific scores via Minkowski pooling. Features are normalized to a common scale relative to analytical bounds, ensuring cross-sequence comparability. We present the framework architecture, report initial verification results on synthetic stimuli with known complexity properties, and outline a parametric sensitivity analysis evaluating the effect of Entropy Weight Method (EWM) regularization, motion scaling, and pooling exponent on discriminability across gameplay sequences of varying complexity.
This paper presents a neural network architecture for binaural rendering of first-order Ambisonics (FOA) signals, enabling headphone listeners to perceive immersive spatial audio from Ambisonic content without requiring individualized Head-Related Transfer Function measurements at the inference time. The model operates in the STFT domain using Complex Ratio Masks (CRM). Unlike magnitude-mask methods that process only the omnidirectional channel and discard phase, the proposed model predicts a shared CRM pair (left and right ear) applied via complex-valued multiplication to all four FOA channels with directional weighting. The omnidirectional channel W contributes at unit weight while directional channels Y, Z, X are weighted at a reduced level, preserving both magnitude and phase information from the full soundfield. The input representation extends standard spectral features with three intensity vector channels that encode sound arrival direction at each time-frequency bin, providing the network with explicit spatial information alongside magnitude and phase cues. Training uses a multi-objective loss that combines waveform-level accuracy (SI-SDR), multi-resolution spectral reconstruction at three complementary time-frequency scales, and interaural level and phase difference terms to jointly optimize signal fidelity and spatial cue preservation. The encoder-decoder backbone is a four-level UNet with residual convolutional blocks and channel-spatial attention at every level, totaling approximately four million parameters. Evaluation against a prior magnitude-masking architecture with 28 million parameters shows that the CRM variant achieves comparable spatial cue preservation with a seven-fold parameter reduction while gaining access to phase information. Processing the signal in a single STFT-domain forward pass avoids the sequential inference of autoregressive time-domain models, yielding computational efficiency suitable for real-time virtual reality deployment
Teaching and Research Assistant, Gdańsk University of Technology
Researcher at the Department of Multimedia Systems, Gdańsk University of Technology, with a focus on audio machine learning, psychoacoustics, signal processing, automatic speech recognition, and deepfake audio detection. His work sits at the intersection of immersive audio and AI... Read More →
This study investigates the perceptually sufficient ambisonic order for beamforming in complex acoustic scenes, defined as the minimum spatial resolution above which no audible improvement is perceived. Two beamforming methods were evaluated: hypercardioid and MVDR beamforming. In contrast to previous studies, the case of an ideal microphone array was considered, in order to the evaluate the beamforming methods independently of ambisonic encoding error. Sound scenes were generated using room acoustic simulations and encoded into ambisonic signals. A perceptual evaluation was conducted using a three-interval/two-alternative forced choice (3I/2AFC) test design with an adaptive procedure. The experiment used a production-constrained reference (7th-order) and a high-order reference (19th-order). Results showed that the required order would depend on the beamforming method and characteristics of the sound scene. Diffuseness profiles can be used to analyze the influence of the ambisonic order on the sound field diffuseness and to evaluate whether the directional information available is sufficient to support effective adaptive beamforming.
Binaural rendering in extended reality (XR) often employs static acoustic profiles that may not correspond to the user’s visual environment, potentially leading to cross-modal incongruence and the room divergence effect. However, the influence of acoustic–visual mismatch on immersion and cognitive load in interactive six-degrees of-freedom (6DoF) environments remains unclear. This study investigated the impact of acoustic–visual divergence on presence and subjective workload during real-time object interaction. An ITU-R BS.1116-3 compliant critical listening room was reconstructed at 1:1 scale in Unreal Engine 5. Ten critical listeners navigated the environment using a Meta Quest 3 headset while performing a 6DoF hand-tracking task. Spatial audio with virtual acoustics was rendered through OSC. Three acoustic conditions were evaluated: acoustically matched (RT60 = 0.21 s), anechoic (RT60 = 0 s), and highly reverberant (RT60 = 2.0 s). Presence and workload were assessed using the IPQ and NASA-TLX. Results showed a significant reduction in Spatial Presence only between the matched and highly reverberant conditions, while workload remained unaffected. The findings suggest that excessive reverberation disrupts environmental plausibility, whereas reflection absence can be partially compensated by visual and sensorimotor cues.
Wave Field Synthesis is a well-established spatialization technique that solves the sweet-spot limitation of conventional sound reinforcement and uniquely allows the synthesis of focused sources — virtual sources positioned between the loudspeaker array and the listener. Despite its potential for extended reality (XR), WFS has remained confined to specialized environments such as live performance, installation or post-production workflows, with no accessible open-source tooling for research and creative authoring. We present AT_WaveSpace, an open-source WFS engine built on JUCE, distributed under the MIT licence, and integrated into Unity game engine, designed to democratize WFS for researchers, developers and creators. Building on the methodological framework of the SoundScape Renderer — which combined WFS engine development with a perceptual research platform — AT_WaveSpace serves simultaneously as a spatial audio delivery tool and as an experimental tool. A perceptual evaluation of near-field distance perception of focused WFS sources was conducted using this framework — a dimension absent from prior literature. Using a Midpoint ComParison procedure, participants were unable to rank sources at 40–100 cm consistently, while they ranked sources at 120–150 cm in correct order. Spectral centroid analysis reveals a distance-dependent timbral variation in the proximal zone whose physical origin remains unclear. Low-frequency ILD remains the primary candidate cue for correct ranking at 120–150 cm. Perspectives for further studies are outlined.
Sound engineers doing live mixing for theatre must manage the balance between on-stage acoustic sources and electroacoustic sounds diffused in the IRCAM:Gallery, triggering, spatialising and mixing pre-recorded and live sounds while actors perform. Typically confined to the control booth of unfamiliar venues, they need to adapt to a listening perspective that differs significantly from the audience's experience in the stalls or the balconies. This work engages with sound studies, virtual acoustics, and archival practices to investigate complementary questions. How does the acoustic dissociation between the control booth and the rest of the venue influence the technical and aesthetic decisions of sound engineers? How would contemporary engineers interpret archived theatrical soundtracks when guided by annotated scripts? To address these questions, the research unfolds in four stages: capture multiple High Order Ambisonics Impulse Responses from emblematic theatres in São Paulo, Brazil, combining flexible sources setups and multiple listening positions; use them to build a real-time convolution engine, integrating the IRs with actors' voices and archival soundtracks from the collection of Brazilian theatrical sound designer Tunica Teixeira; invite sound engineers to perform mixing tasks in a virtual acoustic environment, guided by Tunica's annotated scripts; use the task metrics and structured questionnaires to assess the impact of multi-perspective listening on their technical and aesthetic decisions.
Navigation tasks are often used as a fun and engaging method of exploring and interacting with video game environments. 3D open-world games afford curiosity-driven navigation for players, providing opportunities to follow their agency and interact with points of interest within an environment. However, this is commonly a visually-motivated task that is seldom accessible to Blind and Low Vision (BLV) gamers. Given the impact of this barrier, it is imperative to design navigation systems for games that are driven by auditory information to provide equal opportunities for BLV gamers to engage with open world game environments. There is a noted lack of understanding from game developers currently that evidences the need for dialogue between researchers, BLV gamers and developers. In collaboration with both BLV gamers and developers, we present the early, first co-designed prototypes for a customisable, Blind-accessible auditory navigation toolkit in 3D open-world video game environments. We build on a series of dialogic discussions with Disabled gamers who have experienced barriers in their gameplay experiences and preset three navigation tools. We document the design of these tools and present theme explorations from analysis both each co-design phases. We present discussions on including player agency, action precision, gameplay fluidity and cognitive load, categorisation and identification, sound preference, and tutorialisation and learnability. From these themes, we derive design insights that highlight the barriers and considerations for auditory navigation in video games.
Pleyel.exe is an interactive documentary presented as a video game, exploring the evolving landscape of the Carrefour Pleyel district in Saint-Denis. Through free navigation within immersive 3D scans generated from gaussian splatting, visitors can wander through sites in transition. As they explore, they encounter residents’ testimonies, drawn from in-situ recorded and carefully edited interviews, offering personal perspectives on the neighborhood and its ongoing transformations.
Realistic reproduction of spatial reverberation is essential for immersive audio applications, including virtual reality and interactive gaming. While geometrical acoustics methods enable efficient rendering, they do not fully capture wave phenomena such as low-frequency modal behavior and diffraction, which are particularly significant in small spaces. Wave-based simulations provide higher physical accuracy but at substantial computational cost. This paper extends VSVerb, a 4pi sampling reverberator based on virtual sound sources (VS) extracted via sound intensity analysis, to use pressure and three-axis particle velocity computed by a discontinuous Galerkin finite element method (dG-FEM) simulation, enabling reverberation that reflects the wave-based acoustic characteristics of virtual spaces to be generated. Experiments conducted in a university lecture room demonstrate that simulation-based VS distributions and their corresponding impulse responses closely match those derived from actual measurements. ComParison with measured impulse responses and geometrical acoustics ray tracing shows that the proposed method produces room acoustic parameters, including clarity and definition, closer to the measured reference across most metrics and frequency bands. A tendency to underestimate reverberation time was observed, which may be addressed through improved simulation modeling or post-processing. Furthermore, the VS distribution extracted from a single simulation can be adapted to different receiver positions by re-estimating the geometric contribution of each VS, enabling 6DoF navigation support without additional simulation. These results indicate the potential of the proposed framework for wave-based interactive reverberation in virtual spaces.
We present Music of the Spheres (MOTS), an immersive virtual reality (VR) sequencer that integrates natural interaction with hybrid spatial audio reproduction for music composition and performance. MOTS enables users to create and manipulate sound objects arranged in a 3D step sequencer surrounding the user. Using hand gestures, users can instantiate, position, and remove sounds, simultaneously composing both temporal and spatial musical structures. The system combines binaural reproduction for private preview in the headset and Ambisonic loudspeaker reproduction for shared listening in audience-oriented experiences. In this paper, we discuss the implementation of MOTS and highlight design considerations for intuitive musical interfaces that are uniquely crafted for VR. We also present the results of a survey of 27 participants at a public exhibition, which indicate positive responses in terms of immersion and usability, as well as a coherent spatial audio experience across the hybrid reproduction system. Finally, we outline future directions, including expanded controls, collaborative functionality, and improved spatial audio rendering.
Audio Definition Model (ADM) metadata is central to contemporary object-based audio production and sits at the core of Dolby Atmos workflows. Yet in open research, rapid prototyping, immersive media development, and playback on irregular loudspeaker arrays, Atmos-derived material remains difficult to inspect, translate, and deploy without relying on proprietary tooling. This creates a persistent gap between the spatial audio formats used in industry and the open systems available to researchers, developers, and immersive venues. This paper presents CULT DSP, an open-source spatial audio toolchain designed to address that gap by separating transcoding, scene exchange, playback, and authoring into distinct but interoperable roles. CULT ingests spatial audio exports, extracts and normalizes scene metadata, and exports it for later use; in the authoring direction, the same module packages LUSID scene data and mono stems back into ADM/BWF output. LUSID provides a stable scene and package structure shared across the stack. Spatial Root is a layout-agnostic playback engine for real-time and offline rendering on custom loudspeaker arrays. Its EngineSession API exposes the runtime as a C++ interface used by the GUI, CLI, and external host applications. Four implementation projects extend the toolchain: Spatial Seed uses CULT and LUSID for procedural authoring from stems; LUSIDstreamer treats LUSID frames as lightweight scene-state packets; immersive-allo-root embeds Spatial Root in an AlloLib audiovisual application; and ue-root prototypes a game-engine-facing host path. Together, they show how Atmos-derived metadata can be reused for playback, authoring, inspection, and immersive media development rather than used only for final delivery.
This paper presents an illustrative cross-device evaluation of spatial audio reproduction in smart glasses and XR headsets using binaural in-ear recordings and external sound-level measurements on four anonymized commercial devices. The evaluation is organized around baseline playback behavior, cue fidelity, sound leakage, and robustness to wearing variability, with metrics derived from broadband-noise and swept-sine measurements. The results reveal distinct device behaviors, including differences in channel balance, interchannel signal behavior, preservation of HRTF-encoded binaural cues, perturbation of real-world acoustic cues, external sound radiation, and sensitivity to reseating. Rather than establishing a product ranking, this study demonstrates how the benchmark supports structured cross-device interpretation of wearable XR spatial audio systems.
A virtual artificial head (VAH) can be used to imprint a listener’s head-related transfer functions (HRTFs) onto a recording using a filter-and-sum beamforming approach. The previous version of the so-called Vikk, consisting of 24 microphones, was able to recreate HRTFs with low interaural errors, including temporal and spectral distortions up to \SI{5}{kHz}. A simulation of a revised topology demonstrated an increased frequency range up to \SI{8}{kHz}, motivating us to examine whether the range could be extended further, ideally beyond the audible range. We simulated two microphone topologies with different arrangement strategies based on either a Golomb ruler or Vogel’s spiral. In addition, scaling and weighting were applied to create a denser microphone placement in the centre of the array. Vogel’s spiral achieved results comparable to the Golomb ruler with 24 microphones and is easier to rescale with a larger number of microphones and parametric weighting. For this reason, we selected a weighted Vogel’s spiral to investigate how the number of microphones affects temporal and spectral distortions. Increasing the number of microphones to 32 reduced temporal and spectral distortions, although spectral distortions on the contralateral ear remained above \SI{10}{kHz}. Further increasing the number to 64 microphones reduced spectral distortions and extended the usable frequency range up to \SI{16}{kHz}. These results demonstrate the suitability of the Vikk64 for high-quality reproduction of binaural auralisations in the horizontal plane. Additionally, we outline how combining the Vikk64 with a VR180 camera enables the recording of audiovisual scenes that can be reproduced in virtual reality.