About the Real Time Protocol

Vladimír Toncar

In the previous parts of the Voice over IP Overview, we described how the voice gets digitized, how it is encoded using codecs, and we also touched latency and bandwidth optimization issues. Now it's the right time to learn more about how audio (and possibly video) streams are sent across the network.

The protocol used to send real-time streams of data across a network is simply called the Real Time Protocol (RTP for short). RTP has been originally defined by IETF in RFC1889 and the up-to-date definition is in RFC3550.

When transmitting the streams of data, the protocol needs to handle the following conditions in the network:

  • The network can de-sequence packets
  • Some packets can be lost
  • Jitter is introduced (jitter is a variance of packet inter-arrival time, we will get to it later in greater detail).

Out of these three, RTP aims to solve only two issues, packet de-sequencing and jitter (using sequence numbers and timestamps). When it comes to packet loss, the protocol prefers "real-timeness" to reliability. If some packets get lost, they get lost, it's more important to transmit the stream in real time. Because of this, RTP works on top of UDP. TCP is not suitable for real-time protocols because of its retransmission scheme.

RTP header

Let's have a look at the RTP packet header and point out the most important fields.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 |V=2|P|X|  CC   |M|     PT      |       sequence number         |
 |                           timestamp                           |
 |           synchronization source (SSRC) identifier            |
 |            contributing source (CSRC) identifiers             |
 |                             ....    (optional)                |
 |            data...                                            |
 |                                                               |

Figure 1: RTP Packet header

Figure 1 shows a simplified RTP packet structure (we left out the optional extensions, see RFC3550 for full description). The important fields are as follows:

Payload type (PT): Payload type for the data carried in the packet. The PT field is 7 bit long, so it allows values between 0 and 127. There are several static values defined, for example "0" represents G.711 uLaw, "8" represents G.711 ALaw, and "18" stands for G.729. The interval between 96 and 127 is reserved for dynamic payload types. These dynamic payload types need to be negotiated by whatever signaling protocol is used to establish the VoIP call (e.g. SIP or H.323).

Sequence number: The sequence number starts at a random value and is incremented with each RTP packet sent. This helps to identify packets received out of sequence.

Timestamp: Similar to the sequence number above, the timestamp is initialized with a random value. The clock frequency depends on the payload type. With the most usual narrow-band audio, the frequency is 8000 Hz and the timestamp is the tick count when the first audio sample in the payload was sampled.

Synchronization Source Indentifier (SSRC): A 32-bit identifier of the audio/video stream producer. In a special situation, the stream can be produced by a mixer from several streams. The IDs of the contributing sources can be listed in the CSRS fields and the field CC gives the number of contributing sources. However, you will not see this used very often in practice.

In the most typical situation (no CSRC fields, no header extension), the RTP header consists of 12 bytes.

Real Time Control Protocol

RTCP accompanies RTP and is used to transmit control information about the RTP session. RTCP packets are send only from time to time since there is a recommendation that the RTCP traffic should consume less than 5 percent of the session bandwidth.

The most important content types carried in RTCP packets include:

  • information about call participants (for example, name and e-mail address)
  • statistics about the quality of the transmission (for example inter-arrival jitter and the number of lost packets). The report sent by a participant who both sends and receives data is called a sender report (SR), while reports sent by participants who only receive RTP streams are called receiver reports (RR).

There is a rule that RTP should use an even UDP port number (e.g. 5000) and the related RTCP should use the next odd port (e.g. 5001).

Next section: VoIP Protocols: Introducing H.323

Comments on this piece, or the VoIP Overview as a whole, are welcome on Vladimir's blog.

Related articles