VoIP Basics: Codec Latency vs. Bandwidth Optimization
Vladimír Toncar
As we have shown in the overview of codecs, the low-bandwidth codecs are quite efficient. For example, G.729 will compress 10 milliseconds of audio to 10 bytes and G.723.1 encodes 30ms frames to 24 or 20 bytes.
However, since we send compressed audio frames as payload in RTP packets which are in turn sent over UDP, we need to consider the overhead for IP, UDP, and RTP headers. The overhead is 40 bytes per packet. This is significant when compared with the size of a compressed audio frame if we are not on a local area network and the bandwidth is limited. The table below shows the overhead for several low-bandwidth codecs. I did the calculation for one frame per packet for G.723.1 and GSM, and for 3 frames per packet for G.729 since this codec works with frame size of only 10 milliseconds.
Codec | Nominal bitrate [kbit/s] | Frame length [ms] | Frame size [bytes] | Packet overhead | Actual bitrate [kbit/s] |
---|---|---|---|---|---|
G.723.1 | 6.4 | 30 | 24 | 167% | 17 |
G.723.1 | 5.3 | 30 | 20 | 200% | 16 |
G.729 | 8 | 10 *3 | 10 *3 | 133% | 18.6 |
GSM 06.10 | 13 | 20 | 33 | 121% | 29.2 |
Please note the actual bitrates in the table and compare them with the nominal bitrates. Also note that the bitrate values are only for one direction of the call.
If you want to improve bandwidth utilization, the obvious way to go is to send more frames in one RTP packet. However, as you do this, you also increase latency. If you decide to send, say, 100 milliseconds of audio in one packet. this means that you have added a latency of the same 100 milliseconds. Simply put, the first sample of the first frame arrives together with the last sample of the last frame in the packet, so the total delay is equal to the length of audio carried in the packet.
There is a recommendation that round-trip latency should not exceed approximately 300 milliseconds, otherwise people will start noticing.
When calculating the latency, you need to consider the time it takes to send a packet from one end to the another (your mileage may vary, try to use "traceroute" to get a clue) and the size of the jitter buffer of the receiving end (which can be 50-60 milliseconds worth of audio). Considering all this, I would say the reasonable maximum is to send 60 milliseconds of audio in one packet. This will result in the following bitrates:
Codec | Nominal bitrate [kbit/s] | Frames in 60 ms of audio | Actual bitrate [kbit/s] |
---|---|---|---|
G.723.1 | 6.4 | 2 | 11.7 |
G.723.1 | 5.3 | 2 | 10.6 |
G.729 | 8 | 6 | 13.3 |
GSM 06.10 | 13 | 3 | 18.5 |
As in the previous table, the bitrates are for one direction only. As you can see, we managed to reduce the actual bitrates by 30-40 per cent.
In addition to latency, there are two more things you should consider when increasing the number of audio frames per RTP packet:
- If a packet with a larger number of frames gets lost, the loss is more noticeable to the user.
- With greater end-to-end delay, possible echos become more noticeable.
To wrap this piece up, you should always evaluate your situation and decide whether you need to optimize bandwidth consumption. If you need to, the parameters you can play with are the choice of codec and the number of frames per packet.
Next section: About Jitter