Evaluating binaural rendering against stereo mixes is frequently confounded by "content bias," where listeners' inherent musical preferences obscure spatial quality assessments. To address this, we propose an interpretable predictive model utilizing a pairwise differential approach (Delta Strategy) and a dimension-wise attention neural network. The model achieves a competitive sign accuracy of 68.4%, outperforming traditional baselines. Crucially, the attention mechanism provides retrospective interpretability, revealing fundamental acoustic trade-offs in spatial upmixing: aggressive decorrelation for image widening compromises localization precision and timbral fullness, whereas successful externalization heavily depends on mid-side energy redistribution. This framework offers a robust evaluation tool for spatial algorithms and actionable psychoacoustic guidance for immersive audio production.