5
Until now, we have only been considering the workings of binaural audio on its own and haven’t discussed it in the context of VR. As stated previously, the definition of VR is broad and has many uses, from training medical students, playing video games, to creating interactive and immersive art exhibits. In all cases, the goal is the immerse the subject in a virtual world and provide some form of entertainment or experience. Binaural audio adds that extra layer of immersion so not only do we appear to be in this environment but also sound like we are in that environment[11].
Additionally, we haven’t talked about how the audio will be delivered to the listener. The two main categories would be through headphones and loudspeakers. The setup one is using to play binaural sound is important because the processing necessary for headphones is different than speakers. There are pros and cons of both.
Headphones are more widely used for VR due to their convenience. Because they are more of a closed environment, the audio that is meant to go to the left ear will directly go to the left ear without affecting the right, and vice versa. There will also be less background noise audible while wearing them. Additionally, headphones are more affordable than speakers of the same quality and require little to no setup to get the full effect. The downsides of headphones are that they have their own transfer function that needs to be filtered out, they are sometimes not considered to be equivalent to free-field listening [12], and that Inside-Head Locatedness (IHL) can occur, which is the sense that the sound is coming from inside the head. Despite these detracting factors, headphones are still generally used more often than loudspeakers. The downsides of headphones are more easily remedied and, again, are more convenient for both the developer/audio engineer and the consumer.
Loudspeakers, as mentioned previously, are utilised less than headphones. Unlike headphones, loudspeakers do not provide a closed environment and audio meant for the left ear will be heard somewhat in the right, and vice versa, which is called Cross-Talk. This must be remedied by a
process called Cross-Talk Cancellation (XTC). The audio will also be affected by the reflections of the room, altering the sound, and background noise is not blocked out. Loudspeakers also require a more precise setup to attain the proper effect, which the average person may not do correctly, and generally costs more money than headphones with the same audio quality [1]. That being said, loudspeakers have less of a chance to have IHL and can sound more realistic, as the sound is not being played directly at the listener’s ears. One may choose loudspeakers over headphones due to the nature of the VR environment, such as a virtual environment that multiple people can physically be in at once [12][13][14].
Eliminating Inside-Head Locatedness
Inside-Head Locatedness is the sense that the sound is coming from inside your head, along the interaural axis (also known as lateralisation), which is something we try to get away from when employing binaural audio. This can be remedied by having better/more customised HRTFs, accounting for head movement, and adding effects such as reverberation to match the space you’re ”in” [1][15]. The search for a solution to having more customised HRTFs is still ongoing [9].
An important barrier to note is that, in commercial or general use, you cannot produce customised or personalised HRTFs for every consumer. This means engineers pick generalised HRTFs that work well enough for a large number of people.
Cross-Talk Cancellation
As stated previously, Cross-Talk is when audio meant for one ear (the ipsilateral ear) is heard, at least in part, in the other (the contralateral ear) [13]. This is an inevitable outcome of using loudspeakers to produce binaural audio, as one will never be able to close off one ear from the other. The solution to this is a process called Cross-Talk Cancel-
lation (XTC). For static XTC, static as in this only works for one listener in the ”sweet spot” position in relation to the speakers, filters are measured from the transfer functions of the left speaker to left ear (HLL), the left speaker to right ear (HLR), right speaker to the right ear (HRR), and the right
speaker to the left ear (HRL) in that exact location so are dependant on that location. The signals at the ears can be described as:
where ZL and ZR are the signals at the ears, XL and XR are the input signals, and YL and YR are the signals coming out of the speaker for the left and right, respectively. YL and YR can then be solved and a XTC filter for each of the cross-talk paths can be found from that.
We then use the four cross-talk filters, XTCLL, XTCLR, XTCRR, XTCRL, to filter to the input signal and cancel the cross-talk [12]. There are many options to make XTC dynamic, depending on how freely the listener can move, which include increasing the number of loudspeakers used, moving the ”sweet spot” to follow the listener, and providing XTC at multiple listening locations [13].
There are several issues that need to be addressed with XTC. Though a ”perfect” XTC filter can be fairly easily computed, this will cause issues such as severe spectral colouring, even in the sweet spot, and requires frequency-dependant, constant-parameter regularisation. Additionally, because of reflections that occur when using loudspeakers, accurate localisation requires XTC levels above 20dB, which are difficult to practically achieved. Fortunately, even moderate levels of XTC have been show to significantly help with localisation.