We added POLQA (ITU-T P.863) to our MultiDSLA product some seven months ago. Since then, we have been busy discovering what works and what doesn’t work quite so well in the new algorithm. Most of this learning comes from working with our early adopters of POLQA. We continue to gain further knowledge about using POLQA and how to understand results that don’t make sense. We thought it would be useful to share these odd behaviours over the coming weeks. Here is the first one.
1. Don’t use more than 10s of speech in narrowband mode
POLQA, like previous ITU voice quality metrics, has been calibrated and tested against a large quantity of subjective test material. The subjective tests, and hence the subjective test material, conform to ITU-T P.800 recommendations and contain speech recordings of around 8 to 10 seconds. This length of material was found optimal in achieving repeatable subjective listening quality scores.
This signal length recommendation is also made for objective measurements and certainly we have always promoted the use of 8s per measurement. This is because it is difficult to correlate long recordings with subjective test data. However, there are times when both shorter and longer test sequences are useful. For example, long speech files can show up issues with jitter buffers or signal processing and in these cases we would recommend that the score is used as an indicator only.
The problem is that in POLQA narrowband mode long recordings of, say, 32 seconds can show a drop in score of around 1.1 compared to the average of four discrete measurements. This was not identified earlier because all tests were performed using 8s subjective test material. This does not happen with POLQA super-wideband mode; scores remain consistent as the file length is increased.
The root cause of the problem has been identified, but a new release of the standard that includes a fix will take some time. Meanwhile we recommend you split long speech sequences into 8-10s sections containing two sentences. Obtain the score for each section and then calculate a mean of the scores to represent the overall quality for the longer sequence.
Note: PESQ applies a lower weighting to impairments that occur early in a long recording when calculating its score.