DATASET FOR EVALUATION OF THE PERFORMANCE OF THE METHODS OF SOUND SOURCE LOCALIZATION ALGORITHMS USING TETRAHEDRAL MICROPHONE ARRAYS

. For the development and evaluation of a sound source localization and separation methods, a concise audio dataset with complete geometrical information about the room, the positions of the sound sources, and the array of microphones is needed. Computer simulation of such audio and geometrical data often relies on simplifications and are sufficiently accurate only for a specific set of conditions. It is generally desired to evaluate algorithms on real-world data. For a three-dimensional sound source localization or direction of arrival estimation, a non-coplanar microphone array is needed. Simplest and most general type of non-coplanar array is a tetrahedral array. There is a lack of openly accessible real-world audio datasets obtained using such arrays. We present an audio dataset for the evaluation of sound source localization algorithms, which involve tetrahedral microphone arrays. The dataset is complete with the geometrical information of the room, the positions of the sound sources and the microphone array. Array audio data was captured for two tetrahedral microphone arrays with different distances between microphones and one or two active sound sources. The dataset is suitable for speech recognition and direction-of-arrival estimation, as the signals used for sound sources were speech signals.


Introduction
Sound source localization (SSL) and separation are some of the key elements in developing novel, speech-based human-machine interaction (HMI) systems. Information on sound source position in space or the direction-of-arrival (DoA) might be used to enhance audio and speech signals in such ambient intelligence systems, allowing for better source separation and thus higher quality of operation (Brutti et al., 2008). The development of methods and algorithms for sound source localization requires rigor testing on realistic data.
Most SSL algorithms rely on the usage of an array of microphones, signals from which are further processed to obtain an estimate of the direction of arrival (DoA) of the sound source, or the position of the source of the sound relative to the microphone array. The main classes of sound source DoA estimation are: a) time difference of arrival (TDoA) based methods; b) beamforming-based methods; and c) subspace transformation based methods (Lollmann et al., 2018).

Electronics and electrical engineering
Elektronika ir elektros inžinerija *Corresponding author. E-mail: saulius.sakavicius@vgtu.lt Mokslas -Lietuvos ateitis / Science -Future of Lithuania ISSN 2029-2341/ eISSN 2029-22522020Volume 12, Article ID: mla.2020.11462, 1-8 https://doi.org/10.3846/mla.2020 It can be shown (Guentchev, 1997) that the minimum number of detectors required to obtain unambiguously a solution in three-dimensional space is four and that it is unique. Several authors researched into the localization of sound sources using tetrahedral microphone arrays (Alameda-Pineda & Horaud, 2014;Ozeki & Hamada, 2006). Nevertheless, authors did not release the audio data used for development and evaluation of their methods as an openly accessible dataset.
Recently, several learning-based sound source localization methods were proposed (Adavanne et al., 2017) (Takeda & Komatani, 2017;Chakrabarty & Habets, 2019). For learning-based SSL methods, a huge amount of training audio samples is needed. It is nearly impossible to produce such a large real-world audio dataset. Thus for such methods, a synthetic or semi-synthetic audio dataset is most often created (simulated). Nevertheless, it is desirable to evaluate the performance of such methods on realworld data. A concordance between the simulation and the real-world data is expected. Training audio and geometrical data can be simulated in a virtual environment, which is modeled after a real-world counterpart. To achieve this, it is necessary to know exact parameters of the real-world environment, such as the dimensions and the acoustic properties of the room, the relative positions of the sound sources, and the microphones and the walls of the room. To be usable for the estimation of a threedimensional sound source position or two-dimensional DoA (azimuth and elevation) estimation, the positions of the sound source in the dataset must not be coplanar and must exhibit at least some degree of variance in all three axes. For evaluation of SSL methods aimed at speechbased HMI systems, it is desirable that the signals of the sound sources in the dataset are human speech signals. While there are several audio datasets aimed at the SSL problems, they are all lacking some information or features described earlier: either the room dimensions or the position of the reference point relative to the room walls is unknown, or the sound sources are positioned on the same plane, or the signal of the sound sources is not speech. Thus, we present a simple dataset that satisfies all of the demands mentioned before: audio recordings are produced on a tetrahedral microphone array, using speech signals, with one or more than one simultaneously active sound source and with known dimensions of the room and the positions of the microphones and the sound sources relative to the walls of the room.

Previous work
There are several audio datasets presented earlier (Le Roux et al., 2015), focused on the sound source localization and separation tasks. The LOCATA dataset, presented as a part of IEEE-AASP Challenge on Acoustic Source Localization and Tracking, consists of audio recordings of one or two moving and up to four static sound sources, captured with a multitude of microphone arrays, with number of microphone per array ranging from 2 (binaural system using a dummy head) to 32 (Eigenmike EM32 spherical array). The shortcoming of the LOCATA dataset is that neither the room dimensions nor the distance of the origin of the coordinate system to a corner of the room is presented, which imposes a limitation of usage of the LOCATA dataset for evaluation of learning-based SSL methods, such as presented by He et al. (2018He et al. ( , 2019 or Chakrabarty and Habets (2019), where the model is trained on semisynthetic data, as it impossible to accurately simulate the environment matching the real-world. Also, the moving sound sources were the human subjects, walking in front of the microphone array and talking. Thus there is limited variance of the height of the sound sources relative to the origin of the coordinate system.
The Sound Source Localization for Robots (SSLR) Dataset is a collection of real robot audio recordings for the development and evaluation of sound source localization methods, recorded using Softbank robot Pepper, including robot ego-noise and overlapping multiple speech sources (We et al., 2018). The origin of the coordinate system for this dataset is the center of the microphone array, but no in-formation is given about the room in which the dataset was collected nor the positions of the microphone array within those rooms. Moreover, the sound sources remain stationary, while the robot head is panning to sides. Thus the microphone-room spatial relationship is constantly changing, which is not the case in many ambient intelligence and surveillance systems, where the array is stationary for the duration of operation. Therefore, this dataset may not be well suited for evaluation of performance of static arrays.
Drone Egonoise and localization (DREGON) dataset is aimed at evaluating SSL using microphone arrays embedded in an unmanned aerial vehicle (UAV). The dataset contains both clean and noisy in-flight audio recordings continuously annotated with the 3D position of the target sound source using an accurate motion capture system (Strauss et al., 2018). The dataset includes the description of the room geometry and its reverberation time. Also, the speech signals were used for the static sound source. The downside of this dataset is that the microphone array is mounted on the UAV and is not stationary.
Collectively, none of the mentioned datasets feature a tetrahedral microphone array. We present a dataset for the evaluation of the performance of sound source localization algorithms that is captured by a static tetrahedral microphone array (two sets of experiments with different array geometries). We have used one or two static, simultaneously active sound sources with human speech signals. Our presented dataset includes thorough and explicit measurements of the room and the positions of the microphones and the sound sources with the origin of the coordinate system coinciding with one corner of the room.

Methods and materials
In this section, we present the methods for the dataset acquisition. For all audio recordings, a Tascam US20×20 USB audio interface was used. All recordings were performed with a sampling rate s f = 44.1 kHz and quantization resolution Q = 16 bit. All spatial measurements were made manually using a measuring tape with a precision of ±0.0005 m. The dataset consists of audio files of the microphone array, audio files of the sound sources, the room impulse response (RIR) measurement data, and the information about the positions of the sound sources, the microphones, and the geometry of the room. The array audio data was recorded for two array geometries. For each geometry, there were 10 cases of one active speech sound source and 10 cases for two active sound sources. As a result, a dataset of 40 different microphone and sound source combinations was produced, along with three RIR measurements, each using different combinations of source and microphone positions. The format and acquisition methods of each of these elements are discussed in the next section.
We also present the results of computer simulation using the image-source model for RIR generation, presented in (Allen & Berkley, 1976), of the same parameters as the realworld data to determine the level of discrepancies between the results of simulation and real-world RIR measurements.

Room properties
The dataset was acquired in a cuboid-shaped room in LinkMenų fabrikas, Vilnius Gediminas Technical University. The dimensions of the room were 5.400×5.860×2.640 m. The origin of the coordinate system of the dataset coincided with a corner of the room. Three of four of the walls of the room were made of painted masonry, while the fourth wall was a plaster wall. The volume of the room was 3 89.869 m V = and the total surface area of the room was 2 145.048 m S = . The furniture of the room consisted of three plywood tables, three chairs, several desktop computers, and computer monitors, which were not taken into account to not over-complicate the process of dataset acquisition.
The absorption coefficients of each of the wall were not directly measured but rather calculated from the measurement of the 60 T reverberation time value using Sabine's equation (Sabine & Egan, 1994): Here 20 c is the speed of sound at 20 °C and a is the average absorption coefficient of the surfaces of the room.
Reordering (1) gives The reverberation time can be calculated using Schroeder's method of backward integration of the RIR (Schroeder, 1965).
Schroeder's frequency c F is calculated using an equation provided by Skålevik (2011): We have measured the impulse response of the room at three different combinations of the signal source, and the measurement microphone positions (microphone positions RIR,i M , source positions RIR,i S and Euclidean distances between them RIR, ( , ) i M S ∆ are presented in Table 1 and in Figure 1).
For the RIR measurements, a Mackie Thump12 powered loudspeaker was used as a sound source (axis of the loudspeaker directed to the capsule of the microphone). The measurement microphone was Sonarworks XREF20. RIR was captured using a MATLAB® tool Room Impulse Measurer. Provided by the tool are the two most widely used IR measurement techniques: Maximum-Length-Sequence (MLS) and Swept Sine. MLS technique is based on the excitation of the acoustical space by a periodic pseudo-random signal. The impulse response is obtained by calculating a circular cross-correlation between the measured output of the system and the excitation signal (Stan et al., 2002). The Swept Sine measurement technique uses an exponential time-growing frequency sweep as and the excitation signal. The output of the system is recorded, and deconvolution is used to recover the impulse-response from the swept sine tone (Farina, 2007). We have measured the impulse response using both techniques in all three source-microphone position combinations.

Microphone arrays
We have obtained the audio recordings using two tetrahedral microphone arrays with different distances between the microphones (baseline length, B): ARRAY30 with B = 0.3 m and ARRAY60 with B = 0.6 m. This approach was chosen to allow the evaluation of the influence of the baseline length of the microphone array on the performance of the sound source localization algorithms. Maximum TDoA Each tetrahedral array consists of four identical condenser microphones (RØDE M2). Since the directivity pattern of the RØDE M2 microphone is cardioid shaped, we have positioned the microphones in such a way that the acoustic axes of the microphones were oriented upwards, so that the directivity of the microphones would be close to omnidirectional in a horizontal plane. The position reference point of each microphone coincided with the center of its membrane.

Sound sources
We have recorded the real-world audio data with each of the previously described array with one or two simultaneously active sound sources. The sound sources were represented by two small loudspeakers: battery-powered JBL GO loudspeaker (Source 1, SJ), mounted on a tripod to allow for a convenient positioning; and Yamaha MSP3 amplified two-way compact monitor loudspeaker (Source 2, SY), placed on a portable pedestal or a table.
The position of the sound source is determined by a reference point. For both sound sources the reference points were located in the center of the front grids of the speakers.
The speech signals that were reproduced through the speakers were obtained from the AMI Corpus (Carletta et al., 2006), headset microphone mix (file ES2019a. Mix-Headset.wav). To allow for the two simultaneously active sound sources to reproduce different signals, we have selected two excerpts from the file, each with a duration of 60 s. The first excerpt (E1) began at the 70-th second of the source audio file, and the second excerpt (E2) began at the 310-th second of the file.
Ten positions for Source 1 were randomly selected from a uniform distribution in the entire volume of the room. While all three coordinates were randomly chosen for the tripod-mounted Source 1, Source 2 could only be placed on a fixed height pedestal or the table. Thus its z coordinate z 2 is limited to two values: 0.85 m and 0.865 m above ground; x and y coordinates are the same for both source positions. The coordinates of the Source 1 (x, y, z 1 ) and Source 1 (x, y, z 2 ) of the selected positions are presented in Table 3. As can be seen from the Table 3, the average of coordinates of all source positions are very close to the geometric center of the room and differs from it no more than 8.25% (for x coordinate). The positions of the sources and the centers of both arrays are also presented in Figure 3.
By converting the Cartesian coordinates of the positions of the sound sources to polar coordinates, with the centers of the microphone arrays at the origin of the polar coordinate system, DoAs of sound sources were obtained (presented in Figure 4). DoA with azimuth 0 θ = and elevation 0 ϕ = corresponds to the positive x axis of the Cartesian coordinate system.   Table 3) and the centers of ARRAY30 (MAC301) and ARRAY60 (MAC601) within the room For the single active sound source case, only Source 1 was used, and it was placed at all ten positions (coordinates of which are expressed as (x, y, z 1 )). For the two active sound source case, ten positions of the Source 2 were selected from the Table 3 sequentially, while the positions of the Source 1 were selected from the Table 3 and randomly permuted, resulting in 10 combinations presented in Table 4. The speech signal excerpts were assigned to the sound sources in an alternating manner.

Results
To obtain the average absorption coefficient of the room a, a value of the 60 T reverberation time is needed. This value was calculated from the impulse response of the room. The reverberation time 60 T was calculated for each of the obtained RIR using Schroeder's backward integration method (Schroeder, 1965). The results are presented in Figure  5. The average 60 T value was 60 T = 552 ms, with standard deviation of 33.6 ms. The absorption coefficient was calculated using (2) The measurements of RIRs were compared to the computer simulation of a virtual room with the same dimensions and the placement of the IR measurement sound source and microphone, using Python programming language and pyroomacoustics package, which uses image-source method for impulse response calculation (Scheibler et al., 2017). For the simulation, the absorption coefficient a, calculated in (5) was used, while the maximum order of reflection was 10. By performing the Fast Fourier Transform (FFT) of the RIRs, transfer functions of the room were obtained (magnitude spectra of the transfer functions presented in Figure 6).
As can be observed from the magnitude spectra of the transfer functions in all RIR measurement positions, the simulation is relatively accurate only in the approximate frequency range from 60 Hz to 500 Hz. This range starts at a frequency that is more than twice lower than Schroeder's frequency of the room and does not encompass the widely used telephone band (ITU-T, Rec. P.342, 2009). Thus, the auralization results using simulated RIRs might be inaccurate and unsuitable for reliable evaluation of the performance of sound source localization algorithms using speech signals. For all three measurement positions, the amplitude of the simulated transfer function is significantly higher in the low-frequency range than in measured RIRs. This can be addressed to a) unsuitability of the image source for RIR simulation in low frequency range (wave-based phenomena, such as diffraction and interference, are not properly recreated (Siltanen et al., 2010)) and b) inaccuracy of the real-world RIR measurements, as it relies on the linearity of the transfer functions of the transducers (measurement sound source and microphone, which are not linear. The diffraction effect is stronger at low frequencies where the wavelength is longer than or comparable to the dimensions of the reflecting objects (Siltanen et al., 2010), that is, lower than Schroeder's frequency. The frequency response of Thump12 loudspeaker presents a steep roll-off in the sound pressure level below 70 Hz and above 6 kHz (Loud Technologies Inc., 2017), so it is impossible to obtain fully accurate RIR using neither Swept Sine nor MLS method using such loudspeaker. Considering these findings, it is advisable to evaluate SSL algorithms not only synthetic or semi-synthetic audio data but also on real-world audio data as the simulated audio signals might not accurately reflect the real-world situation.

Conclusions
A dataset of four different scenarios (two tetrahedral microphone arrays with different baseline lengths, one and two active sound sources for each type of array) was created, with ten different source positions (in case of two active sound sources -10 two source position combinations) for each scenario. Positions of sound sources were distributed evenly in the room, with average of coordinates of all sources differing from the geometric center of the room no more than 8.25% (for x coordinate). A set of six room impulse responses was measured using three different combinations of source-microphone positions, using two IR acquisition techniques: MLS and Swept Sine. The reverberation time 60 T was estimated from the RIR using Schroeder's method, and the average reverberation time 60 T was determined to be 0.552 s. The average surface absorption coefficient was derived from the reverberation time and the geometry of the room and was determined to be a = 0.206. The Schroeder's frequency of the room was calculated to be 156.76 Hz.
A computer simulation of a virtual room with the same geometry and acoustical parameters as the real-world room was performed. From the comparison of results, it was determined that the magnitude spectra of real-world and simulated RIRs differ considerably both in low and high-frequency ranges, and the simulation is relatively accurate only in the approximate frequency range from 60 Hz to 500 Hz.
Thus, if a sound source localization method or algorithm is being developed, its evaluation of real-world audio is crucial as the simulated audio signals might not accurately reflect the real-world situation. a) b) c) Figure 6. Magnitude spectra of the transfer functions obtained from the RIR measurements (using Sine Sweep and MLS methods and computer simulation) at positions 1 (a), 2 (b) and 3 (c) (positions of sources and microphones presented in Table 1)