Binaural rendering in extended reality (XR) often employs static acoustic profiles that may not correspond to the user’s visual environment, potentially leading to cross-modal incongruence and the room divergence effect. However, the influence of acoustic–visual mismatch on immersion and cognitive load in interactive six-degrees of-freedom (6DoF) environments remains unclear. This study investigated the impact of acoustic–visual divergence on presence and subjective workload during real-time object interaction. An ITU-R BS.1116-3 compliant critical listening room was reconstructed at 1:1 scale in Unreal Engine 5. Ten critical listeners navigated the environment using a Meta Quest 3 headset while performing a 6DoF hand-tracking task. Spatial audio with virtual acoustics was rendered through OSC. Three acoustic conditions were evaluated: acoustically matched (RT60 = 0.21 s), anechoic (RT60 = 0 s), and highly reverberant (RT60 = 2.0 s). Presence and workload were assessed using the IPQ and NASA-TLX. Results showed a significant reduction in Spatial Presence only between the matched and highly reverberant conditions, while workload remained unaffected. The findings suggest that excessive reverberation disrupts environmental plausibility, whereas reflection absence can be partially compensated by visual and sensorimotor cues.