Comparing Performance Differences between ITU-T Rec. P.863 v2.4 and v1.1 for Speech Quality Metrics
ITU-T Rec. P.863 v2.4 works rather differently from v1.1. Many aspects of the algorithm have been modified. Consequently, the performance differs from v1.1 in some respects. POLQA v2.4 is more sensitive to speaker and language. Some examples are shown below.
Speech quality metrics, like codecs, have some differences in performance according to the spectral content of the speech material. Background noise level and the dynamic range of the voice also affect metric performance.
Impact of Speaker and Language Sensitivity in POLQA v2.4: A Comparative Analysis with v1.1
These graphs show how the speech quality score differs between the POLQA versions when evaluating speech with added Gaussian White Noise at different levels in the analogue domain for a number of speakers. The speech files were selected from material in ITU-T Rec. P.501 and filtered by the MIRS characteristic. The x-axis shows the level of added noise. The y-axis shows the score from the Narrowband Model of P.862.1 PESQ, P.863 POLQA v1.1 and POLQA v2.4.
These three examples are of the best and worst speaker sensitivities in a larger set of tests. POLQA v1.1 exhibited some speaker sensitivity but v2.4 may show more. There is almost no difference in the score between v1.1 and v2.4 for the Japanese female speech but the English male speech shows a greater range. The Russian male speech is consistently 0.2 MOS lower when evaluated by v2.4.
The use of many different speakers when testing speech transmission systems has always been best practice. MultiDSLA now includes male and female speakers from eight different languages, the ITU-T Rec. P.501 Annex C material, to minimise error due to speaker dependency.
Using these 32 speakers to evaluate the performance of PESQ and the two versions of POLQA NB for simple codecs, we can see the wide range of scores that can be obtained in columns 1-64. Column 65 is the mean for the condition. The G.711 A Law scores are generally lower with v2.4 than with v1.1. G711 μ Law scores are only slightly lower.
The mean scores for the different conditions are shown below.
|POLQA NB v1.1||POLQA NB v2.4|
G.711 µ Law
G.711 A Law