AES 2026 AVARIG Conference: Full Schedule

Schedule as of May 2026 - subject to change

Default Time Zone is EDT - Eastern Daylight Time

arrow_back View All Dates

10:30am CEST

A survey of HRTF dataset use in academia and industry reveals no de facto standard

Wednesday July 1, 2026 10:30am - 11:00am CEST

Head-related transfer functions (HRTFs) are crucial for plausible binaural audio playback for virtual, augmented, and mixed-reality applications. In such applications, humans showed higher sound-localisation accuracy, higher perceived externalisation, and experience less colouration when using their individual HRTFs compared to non-individual HRTFs. Because high-quality individual HRTFs require cumbersome measurements in specialised facilities, applications often use non-indivdual or dummy-head HRTFs as a practical alternative. Humans are able to adapt to non-individual HRTFs, which leads to a localisation performance comparable to that achieved with individual HRTFs. Therefore, adaptation to non-individual HRTFs could be a practical alternative whenever individual HRTFs are unavailable; However, this would only be possible if the same non-individual standard HRTF was used across different applications. To find out if this is the case, we conducted a survey on HRTF usage among 76 professionals working in the field of spatial audio. The findings suggest that there is currently no de facto standard HRTF. Surprisingly, only half of those with access to individual HRTFs are actually using them, and most would be willing to switch to a default HRTF set if one was established.

Speakers

Fabian Brinkmann

Katharina Pollack

Nils Meyer-Kahlen

Pedro Lladó

Wednesday July 1, 2026 10:30am - 11:00am CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

HRTFs, Lecture

11:00am CEST

Assessing localisation and localisation uncertainty for off-centre listening in a stereo loudspeaker setup

Wednesday July 1, 2026 11:00am - 11:30am CEST

Jussieu:Conf 1

In loudspeaker-based reproduction, the spatial quality deteriorates when the listeners move outside the sweet spot. While this seems well known in the spatial audio community, perceptual data that allows quantifying this effect is not common, which prevents suggesting solutions for off-centre listening. In this study, we collected perceptual data to answer three main hypotheses: (H1) that localisation for stereo reproduction over loudspeakers and over headphones with binaural recordings of a dummy head would result in similar perceptual outcomes, (H2) that translation of the listener is equivalent to adding the corresponding interchannel time and level differences in the sweet spot, and (H3) that the spread of localisation responses is correlated to the localisation uncertainty perceived by the listener. Regarding H1, the responses for binaural recordings and loudspeakers were equivalent within a 2° margin. Regarding H2, localisation off-centre produced only a shift in the responses compared to interchannel time and level differences in the sweet spot. Regarding H3, the spread in localisation responses strongly correlated with the perceived uncertainty ratings. Altogether, the results suggest that a localisation test using binaural recordings in the sweet spot — including interaural time and level differences — may be sufficient to characterise off-centre localisation and localisation uncertainty for stereo reproduction.

Speakers

Pedro Llado

Rapolas Daugintis

Enzo De Sena

Wednesday July 1, 2026 11:00am - 11:30am CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Stereo, Lecture

11:30am CEST

Crosstalk Cancellation in Loudspeaker Arrays: Effects of Directivity, Array Size, and Listener Position

Wednesday July 1, 2026 11:30am - 12:30pm CEST

Jussieu:Conf 1

Crosstalk Cancellation (CTC) is a technology that enables binaural audio reproduction over loudspeakers. The performance of a CTC system depends on multiple factors, including the geometry of the system, the characteristics of the loudspeakers, and the accuracy of the plant models used to design the CTC filters. While previous studies have examined some of these factors, the combined influence of loudspeaker directivity, array size and listener position has received limited attention. This study models loudspeakers with a spherical pole cap and uses interpolated Neumann KU 100 head-related transfer functions to generate accurate plant responses. CTC filters are computed using a Tikhonov-regularised pseudoinverse approach, and numerical simulations are performed to evaluate the impact of directivity, array geometry and listener orientation on CTC performance.

Speakers

Francesco Veronesi

Filippo Fazi

Jacob Hollebon

Wednesday July 1, 2026 11:30am - 12:30pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Transaural, Lecture

12:00pm CEST

The illusion of elevated lateral sources using loudspeakers in the horizontal plane

Wednesday July 1, 2026 12:00pm - 12:30pm CEST

Jussieu:Conf 1

Spectral manipulation techniques offer a means of generating virtual sound-source elevation using horizontal loudspeakers. In comParison to cross-talk cancellation systems, these techniques can be more flexible and operate with even a single loudspeaker. However, the azimuthal stability of such approaches remains uncharacterised. This study evaluates the effectiveness of magnitude-based difference-spectrum filtering across lateral source positions, including intermediate positions rendered via amplitude panning, in loudspeaker-based reproduction. Direction-dependent filters derived from a mean HRTF magnitude response were applied over a horizontal-plane loudspeaker array, with physically elevated loudspeakers at matched azimuths serving as perceptual references. Perceived virtual elevation was quantified using the illusion ratio, a novel metric expressing virtual elevation shift as a proportion of the physical elevation shift at each azimuth. Virtual elevation reached approximately 50% of the physical elevation shift at central azimuths, decreasing significantly with lateral displacement, consistent with the reduced effectiveness of monaural spectral cues at lateral positions. A greater virtual elevation effect was observed for ipsilateral rather than contralateral source positions relative to the filter ear. Stimulus class did not significantly alter the azimuth-dependent structure of the effect. These results demonstrate that magnitude-based spectral elevation synthesis produces a measurable and robust elevation effect, most pronounced for central sources.

Speakers

Rapolas Daugintis

Enzo De Sena

Monty Bland

Pedro Lladó

Wednesday July 1, 2026 12:00pm - 12:30pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Transaural, Lecture

12:30pm CEST

Trading of interaural time and level differences for stimuli presented using a novel two-listener virtual imaging system

Wednesday July 1, 2026 12:30pm - 1:00pm CEST

Jussieu:Conf 1

Extensive research has investigated the relative influence of interaural level and time differences (ILDs and ITDs) on the perceived position of aural stimuli. Historically, these cues have been compared using trading methods with stimuli presented over headphones. For the purpose of virtual audio applications using multichannel techniques, it is important to establish whether interaural cues are exploited similarly in such listening conditions. In this work, trading experiments were carried out both with stimuli presented over headphones and using a novel two-listener crosstalk cancellation array. Listener responses revealed similar trading behaviour in the crosstalk cancellation case when compared to the headphones case. At the lowest frequency tested, the measured trading behaviour is considered less reliable due to inaccuracies in reproduction of the target stimuli. With this exception, this work demonstrates that the general trends observed in historical ILD/ITD trading experiments also apply to stimuli presented using crosstalk cancellation, namely increased sensitivity to ILD and decreased sensitivity to ITD with increasing frequency.

Speakers

Isaac Lambert

Vlad Paul

Philip Nelson

Wednesday July 1, 2026 12:30pm - 1:00pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Transaural, Lecture

2:00pm CEST

Primary Source Dominance and Acoustic Scene Complexity in 6DoF VR Audio Evaluation

Wednesday July 1, 2026 2:00pm - 2:30pm CEST

Jussieu:Conf 1

In virtual reality (VR) experiences, a primary source often refers to a sound object designated for a central role within a scene, contrasted with contextual background sources. While such sources are typically assumed to guide perceptual attention, it remains unclear whether a designated primary source maintains dominance in overall audio quality evaluation as acoustic scene complexity increases, particularly in six-degrees-of freedom (6DoF) scenarios. This study investigates how per-source rendering quality and scene complexity influence overall audio quality evaluation in 6DoF VR. Rendering quality was manipulated independently for a primary source and background sources, and scene complexity was varied based on the number of sources. Rank-order elimination-by-aspects (EBA) was applied to test dominance patterns across conditions. Results indicate that under low scene complexity, overall evaluation mainly depended on primary source rendering quality. However, in high complexity multisource scenes, this dominance was no longer observed, and evaluation dependence became distributed across sources.

Speakers

Haowen Zhao

Damian Murphy

Wednesday July 1, 2026 2:00pm - 2:30pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Cognition, Lecture

2:30pm CEST

Investigating the Perceptual Relevance of Voice Directivity in Virtual Vocal Instruction Environments

Wednesday July 1, 2026 2:30pm - 3:00pm CEST

Jussieu:Conf 1

Given the limited research on the use of extended reality (XR) technologies in remote music instruction from the perspective of music tutors, this work examines the perceptual importance of voice directivity within a virtual reality (VR) environment. In particular, the perceptual ability to discriminate differences between a measured vocal directivity pattern and a slightly modified omnidirectional directivity pattern is investigated. Two listening tests were conducted to probe directivity perception under (i) static and (ii) dynamic listener conditions within a simulated music practice room, integrating 3rd order Ambisonics Room Impulse Responses (RIRs) and head-tracked binaural reproduction within an interactive Unity-based interface. The static test used an ABX discrimination model to assess directivity detectability as a function of location and stimulus content. The dynamic test involved free navigation around a virtual singer and the evaluation of the perceived directional plausibility, naturalness of the sound emission, and the adequacy of the experience for the assessment of the singer’s vocal characteristics. The results suggest that while listeners can detect differences between vocal directivity patterns under controlled listening conditions, such differences may become less perceptually salient during dynamic interaction within a virtual environment. Nevertheless, the overall positive evaluations in the dynamic listening test indicate that the implemented spatial audio approach provides a plausible and effective auditory experience, supporting its potential use in XR-based applications for remote music instruction and performance evaluation.

Speakers

Eleni Tavelidou

Konstantinos Bakogiannis

Areti Andreopoulou

Wednesday July 1, 2026 2:30pm - 3:00pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Cognition, Lecture

3:00pm CEST

Personality Perception in Acoustic Virtual Reality

Wednesday July 1, 2026 3:00pm - 3:30pm CEST

Jussieu:Conf 1

This study investigates how perceived auditory distance in Virtual Reality (VR) influences social perception, specifically personality attribution. Building on research linking social and physical distance, the work explores whether speakers who sound closer or farther away are judged differently in terms of personality traits. Using the SONICOM 3D Speaker Personality Corpus, the study analysed 360 spatialised speech samples from 120 speakers. Each sample was evaluated by 10 listeners, who rated both perceived auditory distance and five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism). Ratings were averaged to obtain a single score per recording. The analysis proceeded in two stages. First, correlations between perceived distance and personality traits were examined. Results showed significant relationships for two traits: Extraversion and Agreeableness. Specifically, speakers perceived as closer were judged as less extraverted but more agreeable, while more distant speakers were perceived as more extraverted and less agreeable. Second, machine learning models were developed to predict personality trait scores from the speech data. A feedforward neural network achieved above-chance classification performance across all traits. When the model was extended to jointly predict both personality traits and perceived distance, performance improved significantly for all traits. This suggests that perceived distance and personality attribution are linked, and that this relationship includes non-linear patterns not captured by simple correlation analysis. Overall, this is the first study to demonstrate an interaction between perceived physical distance and speech-based personality judgments. The findings highlight the importance of spatial audio in shaping social perception in VR and Extended Reality (XR). They suggest that manipulating the perceived distance of virtual speakers could influence how users interpret social cues, potentially enhancing the design of virtual agents for roles such as teachers, assistants, or companions.

Speakers

Eva Fringi

Nisreen Alshubaily

Stephen A. Brewster

Lorenzo Picinali

Tanaya Guha

Alessandro Vinciarelli

Wednesday July 1, 2026 3:00pm - 3:30pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Cognition, Lecture

4:30pm CEST

The Effect of Authentic Spatial Sound on Verbal Working Memory in Online Virtual Reality Learning Environments

Wednesday July 1, 2026 4:30pm - 5:00pm CEST

Jussieu:Conf 1

This paper investigates the impact of authentic spatial audio on verbal working memory (WM) within a WebXR-based virtual reality learning environment (VRLE). While prior virtual reality (VR) research has predominantly focused on visual modalities, the influence of auditory realism, particularly authentic spatialised sound, on cognitive performance remains underexplored. To address this gap, a controlled within-subjects experiment was conducted using an adapted automated operation span (AOSPAN) task under two conditions: with and without authentic spatial sound. A total of 40 participants completed the study using a head-mounted display in a controlled laboratory setting. The VRLE was implemented using web-based technologies, incorporating ambisonics audio capture and real-time spatial sound rendering. Statistical analysis revealed no significant differences in WM performance across conditions for all measured metrics, including OSPAN score, total correct recall, and error rates. However, results consistently showed a non-significant trend toward improved performance in the presence of authentic spatial sound. In contrast, subjective measures indicated substantial enhancements in perceived presence, immersion, realism, and user preference when spatial ambient audio was enabled. These findings suggest that while authentic spatial sound does not significantly influence verbal WM performance, it enhances experiential quality without increasing cognitive load. The study highlights the importance of incorporating realistic auditory environments in VR design for education, supporting user engagement while maintaining cognitive neutrality.

Speakers

Vincent Russell

David Murphy

Wednesday July 1, 2026 4:30pm - 5:00pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Cognition, Lecture

5:00pm CEST

Assessing the Impact of Spatial Audio on Cognitive Load and Memory Retention for Virtual Training Simulation in Virtual Reality

Wednesday July 1, 2026 5:00pm - 5:30pm CEST

Jussieu:Conf 1

This paper examines whether spatialised audio improves cognitive load and memory retention in Virtual Reality training. Using a commercial VR public speaking module developed by BODYSWAPS, 1,350 real-world users were randomly assigned to either a standard audio (control) or fully spatialised audio with virtual acoustics (study) condition. The study ran over a three-year period, making this the largest study of its kind. Participants completed an exit survey rating five data points: comfort, concentration, realism, retention, and simulation. The spatialised audio group reported consistently higher scores overall, with a statistically significant improvement in perceived comfort (p = 0.006, d ≈ 0.44). Directional improvements were also observed in realism and retention, though these did not reach statistical significance. Gaze-time analysis revealed that the spatialised audio. The group spent more time looking at the primary coaching figures, suggesting that spatial audio may support sustained attentional focus on key instructional sources. The findings indicate that spatial audio design is a meaningful contributor to VR training quality, particularly in comfort and perceived realism, with promising trends for learning efficacy.

Speakers

Oliver Kadel

Tomasz Rudzki

Tom Szirtes

Gavin Kearney

Wednesday July 1, 2026 5:00pm - 5:30pm CEST
Jussieu:Conf 1 4, place Jussieu Paris 5e

Cognition, Lecture

5:30pm CEST

SCHuBERT: a real-time end-to-end model for piano music emotion recognition

Wednesday July 1, 2026 5:30pm - 6:00pm CEST

Jussieu:Conf 1

In this study, we present SCHuBERT, a real-time end-to-end Piano Music Emotion Recognition (PMER) system that operates directly on raw audio and fine-tunes DistilHuBERT for short-window classification on the Valence–Arousal (V–A) plane. Designed for low latency and high responsiveness, the system is particularly well suited for immersive applications such as Virtual and Augmented Reality (VR/AR). Compared with both audio- and symbolic-domain baselines, SCHuBERT achieves strong accuracy in four-quadrant (4Q) classification as well as in binary arousal and valence tasks, while maintaining low computational overhead for real-time operation.

Speakers

10:30am CEST

11:00am CEST

11:30am CEST

12:00pm CEST

12:30pm CEST

2:00pm CEST

2:30pm CEST

3:00pm CEST

4:30pm CEST

5:00pm CEST

5:30pm CEST

Get help with the event