Abstract— and reproducing audio in full 360-degree

Abstract— Recently ambisonics format has gained popularity as
directional/spatial audio encoding format for 360 degree videos, virtual
reality, etc., with major video distribution platforms such as YouTube and
Facebook adopting it for 360 degree videos. One of the most important characteristics
of ambisonics is that it does not require the layout of speakers to be
predefined for encoding. Rather the encoded representation can be decoded for
any given speaker layout, which provides users, the flexibility to choose any
layout of speakers and decode the given ambisonics representation for the same.
The first order ambisonics encoding of a sound field requires four channels of
audio stream and the directional information (spatialization) can be further
improved by going for higher order ambisonics encoding with larger number of
channels. Rendering spatial audio requires a large number of speakers (6 or 8
speakers for 5.1 or 7.1 surround respectively) placed in a specific way around
the listener. All this hardware setup can be replaced with a headphone and an
ambisonics to binaural rendering software. Binaural rendering is based on the
concept of creating the effect of a virtual speaker on headphones using Head
Related Transfer Function (HRTF). The aim of this paper is to present the
studies which focus on positives of ambisonics over the traditional surround
sound techniques and the method for implementing the ambisonics binaural
rendering system.

 

Keywords— Ambisonics,
binaural rendering, 360 degree audio, audio spatialization, audio in virtual
reality

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

________________________________________________________________________________________________________

   
I. INTRODUCTION

Ambisonics is a method of recording and reproducing audio
in full 360-degree surround. Ambisonics treats an audio scene as a 360-degree
sound sphere around center point coming from different directions. Center point
is where the microphone is placed while recording, or where the listeners sweet
spot is located while rendering.

Traditional surround sound technology has several
drawbacks. They only work on predefined array of sounds to produce the output
sound field is the most important drawback of this technology. By contrast,
Ambisonics doesn’t render the audio signal for the predefined set of speakers
but it can render audio on the ?y for any user defined speaker array. It not
only works for static but also for rotating sound field i.e. it works for real
time applications.

When the sound field rotates the sound tends to jump from
one speaker to another when used a traditional approach. Ambisonics uses number
of virtual speakers so the transformation is smooth even when the sound field
is rotated. Traditional surround sound techniques are front biased but
ambisonics distribute the sound evenly in 3D space.

Traditional techniques had difficulties in representing
sound beyond the horizontal dimension. Whereas, Ambisonics works with the
elevation as well, and the effect is more immersive. Rendering of Ambisonics
file format over speakers requires minimum of 4 speakers for the first order
Ambisonics and the number of speakers required are more in case of higher order
Ambisonics.

Binaural rendering of Ambisonics i.e. playback into
headphones (only two speakers left and right) can be achieved using virtual
ambisonics approach.

 

  
II.  KEY TERMINOLOGIES USED WITH AMBISONICS AND ITS
DECODING

 

1.      Ambisonics B-format: B-format is widely
used format for recording sound field using Ambisonics technique. It has 4
channels: W, X, Y and Z. W: Omni directional sound pressure. X: Front-Back
direction with respect to the listener Y: Left-Right direction with respect to
the listener Z: Up-Down direction with respect to the listener.

 

2.      Recording and Encoding B-format:
Recording is done with the help of special sound field microphone. It has one
omni-directional microphone (the W channel) and three figure-of-eight
microphones (the X, Y and Z channels). It is made up of four cardioid capsules
arranged in a tetrahedron, which can be combined as needed to provide the
desired polar patterns.

 

3.      Decoding: The decoders job is to
produce loudspeaker signals that create a good illusion of the required
directional sound field.5 The Ambisonics format can be rendered on any
speaker layout using suitable decoder.

 

4.      The virtual ambisonics approach: To
transform the sound-field into the binaural audio, there is need to decode
ambisonics on virtal array of speakers and then further applying HRTFs on each
mono output of speaker to generate a binaural signal from each speaker which
can further be superimposed to get the final headphone output.

 

5.      Binaural Rendering: Binaural rendering
is converting the output of speakers to headphone output (binaural left and
right) by applying Head Related Transfer Functions (HRTFs)

 

6.      HRIR and HRTF: HRIR (Head Related
Impulse Response) Humans detect the sound source by taking derived cues from
one ear and by comparing cues from both the ears. The cues have two differences
one is time difference and another one is the intensity difference between cues
of both ears. The sound source interaction with the human body modify the
original sound before it enters the ear. These modifications can be portrayed
with the help of the HRIR’s, the head-related impulse response, which locates
the source location. HRIRs help to convert the sound so that it appears to the
user to be played at the desired location. They are used to generate virtual
surround sound. The HRTF is the Fourier transform of HRIR.

 

 

 

Figure 2.1

 

(Fig 2.1 Source: By The original uploader was
Soumyasch at English Wikipedia – Transferred from en.wikipedia to Commons., CC
BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3848567)

 

HRTFs for left
and right ear describe the filtering of a sound source (x(t)) before it is
perceived at the left and right ears as xL(t) and xR(t), respectively as
illustrated in figure 2.1. How an ear perceives a sound from a point in space
is characterized by a Head related Transform function. To synthesize a binaural
sound from a particular point in space a pair of HRTFs (left and right) can be
used. In summary, HRTF is a transfer function which describes how a sound from
a specific point in space will arrive at the ear.

  III.
RELATED WORK

Michael Gerzon has criticized the traditional surround
sound approaches and also has given the criteria for the design of the surround
sound systems 1. The traditional quadraphonic systems never gave the optimum
results. The aim of these systems were to duplicate the effect of ‘original 4
track tape’, but they failed to do so 3. Peter Fellgett said that the
existing techniques were inadequate in number of ways like they restricted to
fixed number of speakers and the production needs of 4 channels to be available
3. Moreover, these techniques rely on encoding the speaker channel
information which can be rendered on predefined speaker layout only – 5.1, 7.1
– otherwise doesn’t give the intended effect. In addition, the traditional
surround techniques were limited to horizontal plane excluding the height
attribute. These techniques are only suitable when the image is stable and doesn’t
suite well for real time applications due to audio scene rotations and the
output jumps from one speaker to another as there is fixed discrete predefined
speaker layout. These existing approaches resulted in poor conditions even
under ideal surroundings 3. They suffered from ‘hole in the middle’ effect
and if the situation is less ideal it becomes unusable. For example, when the
room is non-square or when the listener is not at the sweet spot 5. The use
of the 4th channel always degraded the localization quality, the mentioned
‘hole in the middle’ effect. Thus only 3 channels were recommended and the use
of 4th channel was still a question. This led to the addition of the periphonic
(height) information 3.

While traditional technique of surround sound had its
limitations and disadvantages, Ambisonics, developed in the early 1970s by
Peter Fellget 3 and Michael Gerzon 4 is a way of recording and reproducing
surround sound in both horizontal and vertical surround, which gave more
immersive experience to the listener and provided full upward compatibility to
any number of loudspeakers in the user defined configuration. The traditional
approaches failed to give the intended immersive audio effects, they required
significantly higher number of channels to improve the sound quality, they
required the speaker layout to be predefined and needed the listener to be
present at only a particular position. These were the disadvantages of the
traditional approach.

Monophonic reproduction merely provided information about
direction and distance only. Then the stereo added explicit information for
front sector not more than 60 degree in width 3. Apart from this, various
techniques were suggested by using more loudspeakers, more channels, extending
the directional information beyond 60 degree. As these are separate ways,
Ambisonics aimed to combine these as an integrated whole 2. To record, to
convey and to regenerate the accurate and repeatable surround sound with the
perfect directional effect was the main aim of the ambisonics technology 3.
It is the technology for surround sound which aims not to make the loudspeakers
audible as separate sources of sound 1, 2. Ambisonics technique can be used
with any more number of loudspeakers with reasonable configuration thus
providing for full upward compatibility. Moreover, it is not limited to any
number of channels, the more the number of channels the higher is the
directional resolution 5. The technique is based on a precise and unambiguous
specification of how the encoding should handle directionality in contrast with
quadraphonic approach which handled only 4 directional signals 5. It defines
encoding such that all the directions are equally covered in contrast to the
traditional techniques 5. Why ambisonics is good because it covers 360 degree
information of sound with limited number of channels. 4 channels (first order
Ambisonics) can be rendered on 4 or more speakers with user defined speaker
layout. Ambisonics, in contrast to traditional surround sound techniques, can
create a smooth, continuous and stable sound field even when the sound field
rotates and this is because it is not predefined for any particular speaker
layout, thus suitable for real-time applications.

A Stanford paper by Cedric Yue and Teunde Planque emphasis
on the importance of implementing ambisonics based 360 degree audio system for
a quality feel of virtual reality. They rightly say that good amount of work
has been done in the field of virtual reality focusing on graphics part of it and
audio part is given less attention till date. 8

For making 360 degree audio suitable for mobile
applications and rendering it over headphones, Markus Noisternig, Thomas Musil,
Alois Sontacchi and Robert Holdrich introduces an virtual ambisonics approach
in the paper named ‘3D Sound Reproduction using a Virtual Ambisonics Approach’.
This paper summarizes how the virtual ambisonics approach can be used to
emulate the 360 degree immersive audio experience with headphones using HRTFs
and the intermediate rendering over virtual speaker arrray. It explains how
does convincing binaural sound reproduction requires to filter the sound
sources with the HRTFs. Moreover, it suggests incorporating head-tracking for
further improvements in localization 6 7 10.

Angelo Farina and Emanuele Ugolotti described the software
implementation of B-format encoding and decoding. They introduced the basic
decoding equation which computes the feed Fi for specific speaker in
loudspeaker array.

                                                     

Fi = ½ * G1*W + G2*(X * cos(?) + Y * cos(?) + Z * cos(?)                                             (equation 1)

 

They have specified the values of G1 and G2 gains in the
paper for different regular array configurations. However, this equation works
merely for regular or nearly regular shaped configuration of speakers such as
square, hexagonal etc. and doesn’t work well for irregular configuration of
speakers. 14

On the other hand, Markus Noisternig, Thomas Musil, Alois
Sontacchi and Robert Holdrich, have introduced set of equations using the Morre
Penrose pseudo inverse method to work for irregular speaker configuration as
well.

 

If P is the vector denoting input to the sources, 1st
order ambisonics B format is given as:

 

B = C * P (equation
2)

 

Now, as we already have B i.e. the Ambisonics B-format (W,
X, Y, Z channels) and we need to regenerate P. Then P can be calculated as:

         P = pinv(C) * B;                                                                       (equation
3)

 

Here C is the encoding matrix generated from the speaker
configuration i.e. by considering azimuth and elevation of each speaker (each
column represents one speaker) and pinv is the pseudo inverse. Thus, P matrix
has the mono output signal for each loudspeaker on which HRTFs can be applied
further to get a binaural output from each speaker which can be further
superimposed to get final left and right headphone outputs. 6 7 10

Shu-Nung Yao has described the customization and real-time
implementation of binaural rendering by asking listener to select the closest
matching dataset from the database and finally presenting both the subjective and
objective measurements for the experienced audio quality. 12

In the presentation paper by Bruce Wiggins, he discusses
the algorithm used by the google and analyses the virtual ambisonics approach
for binaural rendering with respect to inter-aural-time, level and spectrum
differences. He has implemented 1 to 35th order ambisonics and carried our
corresponding analysis and the results and conclusions are given in the paper.
13

  IV.
PROPOSED
SYSTEM

       Figure 4.1
illustrates the system architecture diagram of the proposed system. Various
blocks of the system are explained below.

1.      Generating the Decoder matrix: The
speker configuration (the azimuth and evevation for each speaker) will be taken
as input and this function will generate a decoder matrix.

2.      Rendering output to virtual speaker array:
This function will take the 4 channels (W, X, Y and Z) and the decoder matrix
as inputs and generate a mono output for each speaker of the speaker array.

3.      The Cipic HRTF Database: This is the
Database which has the HRIR pairs (left and right) for the range of azimuth and
elevation pairs for each of the speakers.

4.      Adder: This is a simple adder which
will add all the outputs from each HRIR-l and HRIR-r filters and generate a
single left and right final binaural audio.

 

 

 

Figure 4.1

  
V. CONCLUSION

It is understood
that indeed the ambisonics has many advantages over the traditional approaches.
It can also be used for the real-time applications by applying the appropriate
rotations over the matrices. It gives the better audio effects compared to the
previously used approaches and that is why the technology is adopted by
Facebook, Google and many other companies which work in Virtual Reality area.
It has wide range of applications in 360-degree videos, high-end gaming and
other virtual reality applications. Combining ambisonics technology with the
virtual ambisonics approach to generate the binaural output has advantages of
eliminating the hardware demand for loudspeakers and it can be used for mobile
applications too.

This paper has been presented with reference to my work on the ongoing
project on binaural rendering of ambisonics B-format and further aim is to
focus on completion of the project using the virtual ambisonics approach.
Furthermore, this can be extended to consider the head tracking co-ordinates
from the VR devices and improve localization by applying rotations on the
sound-field audio.

x

Hi!
I'm James!

Would you like to get a custom essay? How about receiving a customized one?

Check it out