We added POLQA (ITU-T P.863) to our MultiDSLA product some seven months ago. Since then, we have been busy discovering what works and what doesn’t work quite so well in the new algorithm. Most of this learning comes from working with our early adopters of POLQA. We continue to gain further knowledge about using POLQA and how to understand results that don’t make sense. We thought it would be useful to share these odd behaviours over the coming weeks. Here is the third one.
3. Expect reporting of small delay variations which are not actually there
A telephone network introduces a delay between when we speak and when we are heard by the other person. This is the one-way bulk transmission delay and is the result of geographical distance as well as the processing delay introduced by network equipment. Delay variations, or jitter, describe how the transmission delay varies around the bulk delay. This is a relatively new phenomenon and was first seen when VoIP networks were introduced in the late 1990s. A circuit-switched network has a constant delay once a call is established.
Packet networks inherently exhibit jitter in the arrival time of packets. In VoIP networks the jitter in the arriving packet stream must be removed before playing to the user. This is done by the use of a jitter buffer. The introduction of variable delay in the playback was first seen on VoIP networks when developers realised that they could reduce the bulk delay of a connection by reducing the jitter buffer size if the observed jitter level was low on the arriving stream. Rather than needing to create a large buffer to cope with all network types, the terminal would monitor the incoming arrival times and either increase or decrease the size of the buffer based on the level of jitter observed. This type of delay variation would result in step delay changes in the received audio signal, preferably during silence periods as this passes unnoticed by a listener, but also sometimes during the speech signal. PESQ is good at dealing with this type of delay change and in MultiDSLA we provide a couple of analysis views to help understand where this occurs. Analysing changes in delay can help determine if a drop in speech quality is the result of the underlying packet network having a high level of jitter.
Codec compression techniques have been developed which add a new form of delay variation. This is often called a time-warping, time-stretching or time-scaling effect. It was found that stretching a signal by a number of milliseconds, or compressing the signal in time, without changing the pitch is almost imperceptible to the listener – while delivering a compression gain or improved packet loss concealment. This type of coding is not well handled by PESQ and results in a lower score than seen in a subjective test.
POLQA has a sophisticated time-alignment algorithm to account for both step-delay changes and time-warping. It addresses the time-warping, time-stretching, noisy speech and other difficult signal conditions that made PESQ fail to time align accurately. POLQA shows much more delay variation than PESQ, even with signals that have constant delay. Much of the POLQA time alignment processing relates to iterative searches for the best alignment. This means that it might find a better match a few milliseconds more or less from the expected position. Consequently, the delay statistics and the time offset graphics show a more frequent delay variation.
We recommend that you interpret small swings in delay of up to 10ms as periods of constant delay rather than absolute changes. The bulk delay in the example above was 111ms. There were no actual changes in delay.
Part 1 Don’t use more than 10s of speech in narrowband mode
Part 2 Ensure reference files pass the transparency test