VoIP Basics: About Jitter

Vladimír Toncar

In this series, I have already mentioned jitter twice (in Codec Latency vs. Bandwidth Optimization and in the piece about the Real Time Protocol) so I guess we should have a closer look what jitter is and how to deal with it.

If you ever experimented with the program ping you probably know that if you send a sequence of packets from point A to some point B, each of the packets will need a slightly different time to reach the destination. The varying transit times are not an issue if you are downloading a web page but they matter if you wish to transmit a stream of real-time data. For example, let's suppose that a VoIP device sends out one RTP packet each 20 milliseconds. Figure 1 shows what the stream might look like at the receiving end. The fact that the packets do not arrive precisely each 20 milliseconds means that we cannot play them out as they arrive unless we are willing to accept poor quality of the audio output.

Figure 1

Formally, jitter is defined as a statistical variance of the RTP data packet inter-arrival time. In the Real Time Protocol, jitter is measured in timestamp units. For example, if you transmit audio sampled at the usual 8000 Hertz, the unit is 1/8000 of a second.

The first step to dealing with jitter successfully is to know how large it is. However, we do not need to compute the precise value. In RTP, the receiving endpoint computes an estimate using a simplified formula (a first-order estimator). The jitter estimate is sent to the other party using RTCP (the Real Time Control Protocol).

The formula for estimating jitter is as follows (if you are not much into math, skip to jitter buffer explanation):

J(i) = J(i-1) + ( |D(i-1,i)| - J(i-1) )/16

The estimator computes jitter iteratively. To estimate the jitter J(i) after we receive an i-th packet, we calculate the change of inter-arrival time, divide it by 16 to reduce noise, and add it to the previous jitter value. The division by 16 helps to reduce the influence of large random changes. A change of the inter-arrival time needs to repeat several times to influence the jitter estimate significantly.

In the jitter estimator formula, the value D(i-1, i) is the difference of relative transit times for the two packets. The difference is computed as

D(i,j) = (R_j - R_i) - (S_j - S_i) = (R_j - S_j) - (R_i - S_i)

S_i is the timestamp from the packet i and R_i is the time of arrival for packet i.

Still not very clear? Let's try to do the math with a few sample values. We will asume the sender sends one packet each 20 milliseconds, and that the ideal transit time is 10 milliseconds. To make the example a bit easier to grasp, we will use milliseconds instead of timestamp units. We also start from zero, not from a random value. The table below shows the calculation:

I	S_i	R_i	D(i, i-1)	J(i)
1	0	10	0	0
2	20	30	0	0
3	40	49	-1	0.0625
4	60	74	5	0.3711
5	80	90	-4	0.5979
6	100	111	1	0.6230
7	120	139	8	1.0841
8	140	150	-9	1.5788
9	160	170	0	1.4802
10	180	191	1	1.4501
11	200	210	-1	1.4220
12	220	229	-1	1.3956
13	240	250	1	1.3709
14	260	271	1	1.3477

As you can see in the table, the jitter value starts to grow slowly despite large differences — this the an influence of the noise reduction. When the large differences disappear (i > 8), the estimate starts to approach the approximate mean value.

Jitter Buffer

The network delivers RTP packets asynchronously, with variable delays. To be able to play the audio stream with reasonable quality, the receiving endpoint needs to turn the variable delays into constant delays. This can be done by using a jitter buffer.

The jitter buffer implementation is quite simple: You create a buffer to hold, say, 100 milliseconds of audio — with the sampling rate of 8000 Hz, 100 milliseconds correspond to 800 samples. You place incoming audio frames to the buffer and start the playout when the buffer is, say, at least half full.

Once you start to play the audio, it's a bit of a gamble: you risk both buffer underflow (you need to play another frame but the buffer is empty) and buffer overflow (the buffer is full and you need to throw away the just received packet). To reduce the risk, you can increase the size of the buffer, but you simultaneously increase latency: if you start playing when there's at least 50 milliseconds of audio, you delay the signal by those 50 milliseconds. To improve the odds, you can implement an adaptive buffer — the buffer will change its size based on the current jitter.

Sources of Jitter

I would like to conclude this piece with an observation about the sources of jitter. In addition to varying transit times, jitter can sometimes originate right in the sending computer. This happens when the audio data is not read directly from a sound card (sound cards have a very stable clock, more precise than the computer's on-board clock) but comes from another source — for example, the audio stream is generated by a text-to-speech software or simply read from a file. In other words, we are talking about applications like voice mail and interactive voice response (IVR) systems.

When run on a standard operating system, IVR and voice mail applications can have a problem with precise timing and thus cause a high jitter. Quite often, the operating system process schedulers works with 10 milliseconds quanta. Consider an application that wants to send one RTP packet each 30 milliseconds. The application spends, say, 5 milliseconds doing some processing (e.g. text-to-speech synthesis). After that, it would need to sleep for precisely 25 milliseconds, so that the interval between packets is exactly 30 ms. But because of the 10 ms quantum, the length of the sleep is rounded up to the nearest multiple of 10ms. In other words, the interval between packets ends up being 35 milliseconds. Should this happen in between each pair of packets, you will get a really poor audio quality.

To overcome the issue, you can do two things:

Reconfigure the operating system or install a kernel module or driver that will support a more precise timing.
Or, at the very least, use an adaptive sending algorithm that tries to compensate the incorrect sleep lengths (see section 6 of the OpenH323 tutorial for more about how to do this).

Next section: About the Real Time Protocol