One of the most challenging issue in computational auditory scene analysis is to identify source signals with emphasis on modeling an auditory scene analyzer in a human, which is basically stimulated by binaural cues. In Jiang et al. (2014), deep neural networks were developed for speech segregation in noisy and reverberant environments. Speech segregation is seen as a special type of source separation where the segregated or separated speech is identified through a binaural auditory scene. This is comparable with the way of human listener when hearing a single speaker in the presence of interferer or adverse acoustic conditions. Binaural cues are generally more useful than monaural features for speech segregation. The goal is to build a binaural hearing model to hear a target speaker or to identify his/her speech in the presence of interferences of a non-target speaker or ambient noise. Fig. 7.4 illustrates a procedure of speech segregation, which combines signal processing and deep learning in a hybrid procedure. The binaural hearing system with left and right ears (denoted by black blocks) is receiving sound sources in an auditory scene (represented by dashed circle). Target and interfering signals are present simultaneously. In what follows, we address four components in this procedure, which are implemented for deep speech segregation.

You are watching: The cue of interaural level difference is


*

1.

Auditory filterbank: First of all, the same two auditory filterbanks are used to represent the left- and right-ear inputs by time–frequency (T–F) units which are seen as a two-dimensional matrix of binary labels where one indicates that the target signal dominates the T–F unit and zero otherwise. The so-called ideal binary mask (IBM) is performed. In Jiang et al. (2014), the gammatone filterbank was used as auditory periphery where 64 channels with filter order 4 were set for each ear model. The filter's impulse response was employed in decomposing the input mixed signal into the time–frequency domain. The frame length of T–F units is 20 ms with overlap of 10 ms under 16 kHz sampling rate. This filter simulates the firing activity and saturation effect in an auditory nerve. The left- and right-ear signals in a T–F unit in channel c and at time t are denoted by
There are 320 samples, indexed by k, in a frame associated with a channel.

2.

Binaural and monaural feature extraction: Next, binaural features are extracted according to the interaural time difference (ITD) and the interaural level difference (ILD) (Roman et al., 2003) by using the normalized cross-correlation function (CCF) between two ear signals. CCF is indexed by time lag τ, which is between −1 ms and 1 ms. There are 32 CCF features for each pair of T–F units, denoted by CCFc,t,τ in two ears. The interaural time difference in each T–F unit (c,t) is calculated by
which captures the time lag with the largest cross-correlation function between the two ears. The interaural level difference (in dB) is defined as the energy ratio between the left and right ear for each T–F unit
ILD is extracted every 10 ms, i.e., two ILD features are calculated. At the same time, monaural features based on 36-dimensional gammatone frequency cepstral coefficients (GFCCs) are extracted as complementary features of the speech signal, which are helpful for speech segregation. For each T–F unit pair for two ears (c,t), the 70-dimensional feature vector consists of 32-dimensional CCF features, 2 ILD features and 36 GFCC features.

3.

DNN classification: The success of binary masking in audio signal processing implies that the segregation problem may be treated as a binary classification problem. Speech segregation can be formulated as supervised classification by using the acoustically meaningful features. Here, 70-dimensional binaural and monaural features are employed to detect if a T–F unit (c,t) is dominated by the target signal. A binary DNN classifier is trained by supervised learning. In the training stage, the labels in DNN supervised training are provided by ideal binary mask. In the test stage, the posterior probability of a T–F unit dominating the target is calculated. A labeling criterion is used to estimate the ideal binary mask. In the experimental setup (Jiang et al., 2014), each subband or channel-dependent DNN classifier was composed of two hidden layers. The input layer had 70 units. The output layer produced the posterior probability of detecting the target signal. DNN was pretrained and initialized by the restricted Boltzmann machine (RBM). After RBM pretraining, the error backpropagation algorithm was run for supervised fine-tuning. The minibatch size was 256 and the stochastic gradient descent with momentum 0.5 was applied. The learning rate was linearly decreased from 1 to 0.001 in 50 epochs.

4.

Reconstruction: All the T–F units with the target labels of one comprise the segregated target stream.

See more: What Polynomial Identity Should Be Used To Prove That 35 = 8 + 27?


In the system evaluation, this approach was assessed for speech segregation under noisy and reverberant environments. The reverberant signal was generated by running binaural impulse responses. Simulated and real-recorded BIRs were both investigated. A head related transfer function was used to simulate room acoustics for a dummy head. The speech and noise signals were convolved with binaural impulse responses to simulate individual sources in two reverberant rooms with different room sizes. The position of the listener in a room was fixed. Reflection coefficients of the wall surfaces were uniform. The reverberation time T60 of two rooms was 0.3 and 0.7 s, respectively. BIRs for azimuth angles between 0∘ and 360∘ were generated. Using simulated BIRs, the audio signals and BIRs were adjusted to have the same sampling rate of 44.1 kHz. Four real-recorded reverberant rooms had reverberation time of 0.32, 0.47, 0.68 and 0.89 s, respectively. Input SNR of training data was 0 dB while that of test data was varied. Babble noise was used. Number of non-target sources was increased from one to three for comparison. The performance of speech segregation was evaluated by the hit rate, which is the percentage of correctly classified target-dominated T–F units, as well as the false-alarm rate, which is the percent of wrongly classified interferer-dominated T–F units. In addition, the SNR metric, calculated by using the signals resynthesized from IBM and the estimated IBM, was examined. Experimental results show that SNRs were decreased when increasing the number of non-target sources. Merging monaural features with binaural features performed better than using binaural features alone. This method significantly outperformed the existing methods in terms of the hit and false-alarm rates and the SNRs of resynthesized speech. The desirable performance was not only limited to the target direction or azimuth but also other azimuths which were unseen but similarly trained. In the next section, a number of advanced studies and extended works are introduced. Signal processing plays a crucial role in implementation of learning machines.