A Replay Speech Detection Algorithm Based on Sub-band Analysis

. With the development of speech technology, various spoofed speech has brought a serious challenge to the automatic speaker verification system. The object of this paper is replay attack detection which is the most accessible and can be highly effective. This paper investigates discrimination between the replay speech and genuine speech in each sub-band. For sub-bands with discrimination information, we propose a new filter design approach. Finally, experiments are conducted on the ASV spoof 2017 data set using the algorithm proposed in this paper which demonstrates a 60% relative improvement in term of equal error rate compared with the baseline of ASV spoof 2017.


Introduction
Automatic speaker verification (ASV) is a biometric authentication technique that is intended to recognize people by analysing their speech.With the rapid development of this authentication technique, ASV technique has been extensively used in the fields of life, judicial, and the financial.Compared to other biometric authentication techniques, such as fingerprints, irises, and faces, Voiceprint authentication does not require users to perform face to face contact.Therefore speech is more susceptible to spoofing attacks than other biometric signals [1,2].Secondly, high-quality audio capture devices and powerful audio editing software are more conducive to spoof voice to attack ASV systems.
Spoofing attacks can be categorized as impersonation, replay, speech conversion and speech synthesis [3].For impersonation attacks, existing ASV techniques have been able to effectively resist this spoofing attacks.Speech conversion and speech synthesis requires the counterfeiters has more specialized technical.In addition, this spoof attacks can be effectively defended by existing solutions [4,5].However, replay attacks are the most accessible and can be highly effective.More importantly, popularity and portability of high-fidelity audio equipment in recent years have greatly increased the threat of replaying speech to ASV systems.
In the past two years, replay attacks have received extensive attention from researchers.The ASV spoof 2017 Challenge uses the Constant-Q Cepstral Coefficients (CQCC) to detect spoofing attack and its equal error rate (EER) is 24.55% [6].In this database, the multi-feature fusion methods and the integrated classifier methods are used for replay attack detection [7] and its EER is 10.8%.The fusion of the two features of RFCC and LFCC reduced the EER to 10.52% [8].In addition, the I-MFCC feature has also been shown to be effective in detecting replay speech [9].At the same time, high-frequency information features obtained by CQT transformation has also proven to be effective [10].Recently, Delgado et al. used the Cepstral Mean and Variance Normalization (CMVN) method on CQCC features [11].The results show that this method is very effective for detecting replay attacks.Although the above work is significantly improved compared to the baseline, the computational complexity is relatively high due to the introduction of the CQT transformation.
Recent work focused on how to find effective features rather than analysing the differences between replay and genuine voice in each sub-band.Further, according to the differences reflected in different sub-bands, feature extraction approaches are discussed in this Work.

Database
The ASV spoof 2017 corpus is used in our investigations.The corpus is partitioned into three subsets: training, development, and evaluation.A summary of their composition is presented in Table 1.This paper uses Train and Development to train the model and Evaluation to test the performance of the model.

Sub-band analysis
First, the speech signal is transformed from the time domain to the frequency domain by time-frequency transformation method.Then the entire frequency band is divided into 16 sub-bands and 8 sub-bands.During the experiment, one sub-band is removed at a time, and the remaining sub-bands are used to extract the sub-band features and used the GMM model for training; the equal error rate (EER) is used as the metrics of feature performance.Finally, a classification level measure of discriminative ability is estimated using EER ratio of a sub-band based spoofing detection system.

Sub-band division and analysis
The sub-bands feature extraction process is shown in Fig. 1.For each frame of speech, frequency bins are subdivided into sub-bands based on DFT bin groupings.The number of the DFT bins is 256, and the window function is the Hanning.During the experiment, one sub-band is removed at a time.Within remaining sub-bands, DCT is applied to the corresponding log magnitude to obtain the remaining sub-band features.The features include 150 dimensions, comprising of 50 DCT coefficients along with the deltas and delta-deltas.Cepstral mean and variance normalization (CMVN) [12] is an efficient normalization technique used to remove nuisance channel effects.Therefore, the CMVN technique is applicable to sub-band feature.
The  represents the equal error rate of all sub-bands,   represents the equal error rate of the remaining sub-bands after removing the i-th sub-band, and   represents the ratio of   and  which represents the contribution capacity of the i-th sub-band.The ratio is defined as follows: The first approach involved dividing the speech bandwidth into uniform 1 kHz wide sub-bands.And the second approach involved dividing the speech bandwidth into uniform 0.5 kHz wide sub-bands.The two approaches are referred to as 8-band and 16band divisions in the rest of the paper.

GMM Models and Performance Indicators
In section 3.1, we removed each sub-band feature at a time.Within the remaining subbands, a 256-component GMM system is used to determine the discriminative ability within a removed sub-band.The process of GMM model training and identification is shown in Fig. 2. The primary metric is the EER [13].Specifically, the 0-1kHz and 7-8 kHz sub-bands are identified as the most discriminative frequency regions.As can be seen from Table 2 and Table 3, the 0-0.5kHz and 7-8 kHz sub-bands are identified as the most discriminative frequency regions.And compared to lowfrequencies, high frequencies contain more discriminative information.

Filter banks design
For the better use of the discriminative information brought by the 0-1 kHz sub-band and the 7-8 kHz sub-band, we have proposed two filter design approaches.The basic idea behind the proposed approaches is the allocation of a greater number of filters within the discriminative sub-bands [3].Two different filter banks design approaches are presented in this paper.All two approaches involve assigning the center frequencies of triangular filters across the speech bandwidth.The initial approach is allocating more linear filters in discriminative frequency bands based on the i r in Section 3. The second approach is also based on i r , which is allocating Mel filter banks at low-frequencies bands, linear filter banks at intermediate frequency bands, and I-Mel filter banks at high-frequencies bands.The output of the filter is defined as the cepstrum coefficient which includes 46 dimensions, comprising of 15 DCT coefficients along with the deltas, delta-deltas, and log-energy.The process of feature extraction is shown in Fig. 3.

Linear filter design
This approach idea is the allocation of a greater number of filters within the discriminative sub-bands.The number of linear filters allocates in each band is related to the i r .For example, in an 8-band experiment, the i r at 0-1 kHz is 1.5, the i r between 1-7 kHz is around 1.0, and the i r between 7-8 kHz is around 1.8.Therefore the 8 -band filter design is to design 6 linear filters per 1 kHz in the 0-1 kHz frequency band.In the frequency band of 1-7 kHz, 4 linear filters are allocated per 1 kHz.In the 7-8 kHz frequency band, 7 linear filters are allocated per 1 kHz.The shape of the filter banks is shown in Fig. 4 Fig. 4. 8 sub-band linear filter design According to the 8 sub-band design idea, the 16-band filter bank is designed to allocate 3 linear filters in 0-0.5 kHz, 26 linear filters in 0.5-7 kHz, and 7 linear filters in 7-8 kHz.The shape of the filter banks is shown in Fig. 5.

Mel, Linear, and I-Mel filter design
This approach idea is not only to allocate a greater number of filters within the discriminative sub-bands but also assign more appropriate filter types to the corresponding sub-bands.At low frequencies, we use the Mel filter design to enhance the details of the low frequencies.At high frequencies, we use I-Mel filters (inverting the Mel scale from high frequency to low frequency) to enhance the detail of the high frequencies, while the Intermediate frequency uses linear filters.According to the above theory, the 8-band filter is designed to allocate 6 Mel filters per 1 kHz in the frequency band of 0-1 kHz.In the frequency band of 1-7 kHz, 4 linear filters are allocated per 1 kHz.In the 7-8 Hz frequency band, 7 I-Mel filters are allocated per 1 Hz.The shape of the filter design is shown in Fig. 6.

Results and Discussion
This paper proposes a new filter design method by calculating the EER ratio for each sub-band to determine the number and shape of filters for each sub-band.In order to verify the validity of the filter bank designed in this paper, we compare the cepstrum coefficient proposed by the filter bank proposed in this paper with the cepstrum coefficient proposed by the traditional filter.The cepstrum coefficient proposed by the traditional filter is defined as LFCC.MFCC, I-MFCC, extraction process as showed in Fig. 3.The cepstrum coefficient includes 46 dimensions, comprising of 15 DCT coefficients along with the deltas, delta-deltas, and log-energy.In addition, we compare the algorithm proposed in this paper with the algorithm proposed by other researchers.Experimental results show that our algorithm is superior to other literature to varying degrees.

Conclusions
In this paper, we have used EER ratio to identify sub-bands that contain discriminative information between genuine and replay speech.Two such discriminatory sub-bands were identified: 0-0.5 kHz and 7-8 kHz.We have then proposed two approaches to designing banks of triangular filters that allocate a greater number of filters to the more discriminative sub-bands.The two approaches were experimentally validated on the ASV spoof 2017 corpus and outperform other approaches proposed by other researchers.Considering that the number of filters in the filter bank is a key parameter that may have a significant effect on system performance.Therefore, future work will pay more attention to the choice of each sub-band filter.

Fig. 2 .
Fig. 2. GMM training process -bands.The experimental results show that at low-frequencies, 0-0.5 kHz contains more discriminatory information than 0.5 Hz-1 kHz.Also in the high-frequency region, 7.5 kHz-8 kHz contains more discriminative information.

Table 1 .
Statistics of the ASV spoof 2017 corpus.

Table 2
shows the i EER and i r for the 8 sub-bands.The experimental results demonstrate that the i r of the 1st and 8th sub-bands are obviously greater than 1.

Table 2 .
The experimental result of 8-bands

Table 3
shows the

Table 3 .
The experimental result of 16-bands

Table 4 .
Experimental results