Performance Issues in Mobile Video Telephony
After having given an overview of the standards for
mobile video telephony, including CS/PS terminals and call control
issues in 3GPP networks, in this section we will make some
considerations for and remarks on performance. In particular error
resilience, QoS profiles for conversational service, QoS metrics for
video, video quality results for 3G-324M terminals, SIP signaling
delay, and RTCP reporting capability aspects will be analyzed.
21.5.1 Error Resilience and QoS
In mobile video telephony special attention must
be paid to error resilience issues. Because an efficient system must
operate with minimal end-to-end delays, often there is not enough time
for media reparation, whenever media is hit by errors due to the lossy
characteristics of the air interface. In most of the cases, forward
error correction (or redundancy coding) is the only means to provide
error resilience within the imposed delays. In addition, to reduce the
impact of data corruption and packet losses on the received media, some
special shrewdness also can be taken. Here we will focus on PS video
telephony systems.
When encoding a video signal using the H.263 Profile 3,
the achieved error resilience is higher than the baseline H.263. MPEG-4
visual offers also advanced tools for error resilience, such as data
partitioning, RVLC (Reversible Variable Length Codes) and
resynchronization markers. To guarantee low delay, the specification [56]
recommends that the video packets must be no larger than 512 bytes. In
general, the smaller the packets, the smaller the amount of video data
lost (and the visual quality loss) in case of packet losses. On the
other hand, too-small packets produce excessive RTP/UDP/IP header
overhead. The choice of the right packet size is a trade-off between
error resilience, delay, and bandwidth occupancy. The packet size also
can be changed dynamically on the fly, based on the condition of the
network link.
When encoding and packetizing speech data with AMR or AMR-WB, the specification [57] mandates (or forbids) the use of certain codec options:
-
Speech data must be packetized using bandwidth-efficient operations. The encapsulation algorithm [58] offers both bit and byte alignment of data. The former is more efficient in terms of bandwidth usage.
-
Encapsulation of no more than one speech frame
into an RTP packet to keep the delay at the minimum. One AMR speech
frame is of 20-ms duration. This implies that the packet rate at the
videophone terminal is 50 packets per second for both incoming and
outgoing RTP flows.
-
The multichannel session shall not be used.
-
Interleaving shall not be used. This causes an increase in delay.
-
Internal CRC shall not be used. Data correction is performed in the lower layers of the protocol stack. This saves bandwidth.
For the transmission of real-time text using T.140, the
use of redundancy coding is recommended to provide a better error
resilience. [59]
At the network level, 3GPP specifications offer the
possibility to configure the QoS profile for a conversational
multimedia application running over a conversational PDP context. The
specification [60] defines the recommended target figures for error rates and delays:
In CS networks, errors in the air interface
produce single bit errors in the video packet payload. A video decoder
is generally resilient to bit error rates up to 10–3. In PS networks,
errors in the air interface produce erroneous packets that generally
are not forwarded to the higher protocol layers than IP. So, they are
regarded as lost packets. In this case, SDU error ratios as indicated
previously can be used to provide enough media resilience from packet
losses.
21.5.2 Video QoS Metrics
Adequate techniques for objective and subjective
speech and video quality assessment must be adopted to guarantee that a
given mobile videophone implementation fulfills a minimum set of QoS
requirements. This section focuses on video quality metrics.
When developing mobile video telephony applications the
need is to decide which fundamental quality parameters should be
selected as key parameters in QoS assessment
of video. For this purpose both subjective and objective metrics must
be used, because they can be considered complementary. You are referred
to Curcio [61]
for details about subjective metrics. Regarding objective quality
metrics, standardization bodies have defined some methods. For example,
the ANSI [62] and ITU [63]
standards describe some metrics. However, for some of the metrics
described in these documents, the implementation is not
straightforward. Despite the effort of standardization bodies to define
common video quality metrics, often the most-used objective method for
video quality assessment is the PSNR (Peak Signal-to-Noise Ratio),
because it is the easiest to apply to the metrics available. However,
other useful quality metrics can be put to use when developing mobile
videophone terminals. For further details on the metrics computation
methods, please refer to Curcio. [64], [65]
The quality metrics are categorized into six classes, depending on the type of information they can provide:
-
Frame-based: This set of metrics gives
information about the number of frames that have been processed
end-to-end. The metrics are
-
Bit rate-based: The objective of these metrics is
to provide information about the repartition of the channel bandwidth.
This information is precious for optimizing system performance. The
metrics are
-
Packet-based: These metrics give information
about the packets that are generated by the RTP packetizer or the H.223
multiplexer:
-
Loss- or corruption-based: These metrics provide
information about the amount of packets lost, or the amount of
correctly/incorrectly delivered data:
-
PSNR-based. PSNR is a measure of the difference
between the original frame and the corresponding encoded (or decoded)
frame. PSNR-based metrics are
-
PSNR of the video sequence
-
Standard deviation of PSNR
-
PDF (Probability Density Function) and CDF (Cumulative Density Function) of PSNR
-
Representative run for subjective evaluation (when multiple simulation runs are considered)
-
Delay-based. Delay is a very critical issue in
mobile video telephony. Because end-to-end delay is made up of
different components, one approach would be to measure the different
delays and try to optimize them separately. A set of measurable delays
are
-
Capturing delay
-
Initial video encoding delay (time required to encode the first INTRA frame)
-
Encoding delay for video frames (minimum, average, and maximum)
-
Packetization delay
-
Transmission delay (related to the network)
-
Depacketization delay
-
Decoding delay for video frames (minimum, average, and maximum)
-
Display delay
-
End-to-end delay
-
PDF and CDF for any of the delay components above
-
Out of delay constraints rate (to measure the percentage of delay violation over a fixed threshold T of time)
-
Delay jitter computed for different delays above (a particularly interesting value is the frame rate jitter)
21.5.3 Video Quality Results for 3G-324M
To provide an idea of the performance of a
videophone in mobile environment; we have implemented a PC version of
3G-324M terminal and made mobile-to-mobile calls between two 3G-324M PC
terminals through a simulated circuit-switched WCDMA network at 64
kbps. Table 21.2 summarizes the main simulation parameters used in our tests.
Table 21.2: 3G-324M Simulation Parameters
|
Speech |
Preencoded AMR speech stream with average bit rate of 4.9 kbps (silence suppression is used) |
|
Video codec |
H.263+ with Annex F, I, J, T |
|
Input frame rate |
30 fps |
|
Frame size |
QCIF (176 × 144 pixels) |
|
Original video sequence |
Carphone concatenated three times
Original (382 frames, 12.7 seconds)
Concatenated (1146 frames, 38.1 seconds) |
|
WCDMA channel bit rate |
64 kbps |
|
Mobile speeds |
3 and 50 kmph |
|
Bit error rates (BERs) |
64 kbps, 3 kmph: 2*7E-05 and 2*2E-04
64 kbps, 50 kmph: 2*6E-05 and 2*2E-04 |
|
Frequency |
1920 MHz |
|
Chip rate |
4.096 Mbps |
|
Transmission direction |
Uplink |
|
Interleaving depth |
40 ms |
|
Coding |
1/3-rate turbo code, 4 states |
|
Duration of each error pattern |
180 seconds |
|
Multiplexing |
H.223 Level 2 |
|
Number of simulations |
10 for each error pattern file (each time starting from a different random position of the file) |
The error patterns were injected two times into the bit
stream to simulate the case of mobile-to-mobile connection, where two
radio links are involved (this is the reason the bit error rates are
doubled in Table 21.2).
The results obtained were measured in terms of average
PSNR over 10 runs, standard deviation of PSNR, frame rate, delay,
bandwidth usage, and visual quality. The reader interested in details
about performance of 3G-324M terminals at different bit rates with
service flexibility for WCDMA and HSCSD networks may refer to Curcio
and coworkers, [66] Hourunranta and Curcio, [67] and Curcio and Hourunranta. [68] Average PSNR results are shown in Table 21.3.
Table 21.3: PSNR for Video over 3G-324M
|
Speed/BER |
Average PSNR (dB) |
Standard Deviation of PSNR (dB) |
|
Error free |
32.12 |
0.02 |
|
3 kmph 2*7E-05 |
32.01 |
0.07 |
|
3 kmph 2*2E-04 |
31.65 |
0.19 |
|
50 kmph 2*6E-05 |
31.89 |
0.14 |
|
50 kmph 2*2E-04 |
31.64 |
0.19 |
The maximum quality loss achieved at the higher BER is
below 0.5 dB, with a maximum standard deviation below 0.2 dB. The
average encoding frame rate was 10.2 frames per second. The end-to-end
delay from encoding to display (excluding capturing and network delay)
was 140 ms, of which 98 ms was for shaping delay [69] and 42 ms was the processing delay. The bandwidth usage is reported in Table 21.4.
Table 21.4: Bandwidth Repartition for 3G-324M
|
Type of Data |
Percentage of Occupancy on the Total Bandwidth |
|
Video data |
84 |
|
Audio data |
8 |
|
H.223 multiplexer overhead |
8 |
Finally, Figure 21.9
shows the average visual quality under the worst of the conditions
tested (BER = 2*2E-04 at 3 kmph). The picture has been selected in a
way that the PSNR of the sample picture is as close to the average PSNR
as possible (31.65 dB). As it can be seen, the picture does not show
critical degradations, and its quality is fairly good.
21.5.4 SIP Signaling Delay
One factor that influences the overall user QoS,
in addition to media quality, is the call setup delay. This is
important when globally evaluating user satisfaction for a certain
service. We take this issue into consideration in this section,
evaluating the performance of a SIP user agent (UA) signaling with
video telephony capabilities that we have implemented.
When SIP is used over UDP on a mobile network, the call
setup time between two terminals can vary because of the following
factors:
-
Lossy nature of the channel: If SIP packets are lost during call establishment, these are retransmitted.
-
Size of the channel: Smaller network bandwidths yield higher call setup delays than larger bandwidths.
-
Processing delays in the network: Each network element takes some time to process the requests made by the endpoints.
-
Congestion in the network path along the two end-points.
The message reliability system defined in SIP [70]
is made in such way that it can cope with packet losses and unexpected
delays within the network. The basic idea is that if a SIP message is
not received within a certain specified time, it is retransmitted by
the protocol itself. In the following, the retransmission rules for the
different SIP messages exchanged in a session between two SIP UAs such
as one in Figure 21.6 are explained:
-
INVITE method. A SIP UA should retransmit an
INVITE request with an interval that starts at T1 seconds, and doubles
after each packet transmission. T1 is an estimate of the round-trip
time (RTT). The client stops retransmissions if it receives a
provisional (1xx) or definitive (2xx) response, or once it has sent a
total of seven request packets. A UA client may send a BYE or CANCEL
request after the seventh retransmission (i.e., after 64*T1 seconds).
In our implementation the value of T1 is set at 0.5 seconds.
-
BYE method. In this case, a SIP client should
retransmit requests with an exponential backoff for congestion control
reasons. For example, if the first packet sent is lost, the second
packet is sent T1 seconds later, and eventually the next one after 2*T1
seconds (4*T1 seconds, and so on), until the interval reaches a value
T2. Subsequent retransmissions are spaced by T2 seconds. T2 represents
the amount of time a BYE server transaction will take to respond to a
request, if it does not respond immediately. If the client receives a
provisional response, it continues to retransmit the request, but with
an interval of T2 seconds (this is done to ensure reliable delivery of
the final response). Retransmissions cease when the client has sent a
total of 11 packets (i.e., after T1*64 seconds), or it has received a
definitive response. Responses to BYE are not acknowledged via ACK. In
our implementation the values of T1 and T2 are set to 0.5 and 4
seconds, respectively.
-
ACK method. ACK is not retransmitted, but in case of loss the UA server retransmits the 200/OK.
-
Informational (provisional) responses (1xx). UA
servers do not transmit informational responses reliably. For instance,
our implementation does not retransmit informational responses
(100/TRYING, 180/RINGING). However, the UA server, which transmits a
provisional response, will retransmit it upon reception of a duplicate
request.
-
Successful responses (2xx). A UA server does not
retransmit responses to BYE. In all the other cases a UA server, which
transmits a final response, should retransmit it with the same spacing
as the BYE. Response retransmissions cease when an ACK request is
received or the response has been transmitted 11 times (i.e., after
64*T1 seconds). The value of a final response is not changed by the
arrival of a BYE or CANCEL request.
In 3GPP Release 5 networks, the timers T1 and T2 are set to different and more-conservative values.
The tests we have run have been performed over a 3GPP
Release '99 network emulator. The results will be expressed in terms of
the following metrics:
-
Postdialing delay (PDD). It also is called
postselection delay or dial-to-ring delay. This is the time elapsed
between the caller clicking the button of his terminal to make the call
and hearing the terminal ringing. In our case the PDD corresponds to
the time T1 (see Figure 21.6).
-
Answer-signal delay (ASD). This is the time
elapsed between the phone being picked up and the caller receiving
indication of this. In our case the ASD corresponds to the time T2 (see
Figure 21.6). It must be noted that the
caller receives notification that the callee has picked up the phone
when the first receives the 200/OK. However, the call-signaling
handshake is completed when the callee receives the ACK from the
caller. This is the reason we have considered the ASD in this way.
-
Call-release delay (CRD). This is the time
elapsed between the phone being hung up by the releasing party (the
caller in our example in Figure 21.6) and a new call can be initiated/received (by the same party). In our tests the CRD corresponds to the time T3.
Results of simulations are shown in Table 21.5.
No signaling compression algorithms were used. For comparisons between
calls over 3GPP Release '99 networks and calls in Intranet or WLAN
environment, and for further details about SIP signaling delays, the
reader can refer to Curcio and Lundan. [71], [72]
Table 21.5: Call Setup Times for SIP Signaling
|
Call Setup Metric |
Delay (ms) |
|
Postdialing delay |
62 |
|
Answer-signal delay |
45 |
|
Call-release delay |
50 |
Table 21.6
contains results for SIP call set-up times in the case of restricted
bandwidths. Also in this case, no signaling compression was used. Table 21.5
results assumed a bandwidth of 384 kbps; however, in many cases it is
better to assume, as we did, that the bearer reserved for SIP signaling
is a dedicated one, and of smaller size. Also, when running the tests
with restricted bandwidth we have injected 2 percent packet losses
using the NISTNET [73]
simulator. The figures show that there is an increase in postdialing
delays up to almost one second for network bandwidths as narrow as 2
kbps.
Table 21.6: Postdialing Delay for SIP Signaling with Limited Bandwidth
|
Network Bandwidth (kbps) |
Delay (ms) |
|
2 |
981 |
|
5 |
427 |
|
9.2 |
287 |
|
16 |
164 |
|
32 |
119 |
|
64 |
78 |
The results presented in this section are not
related to SIP signaling within Release 5 of 3GPP specifications, where
SIP is part of the call control in the IMS. In this case, the SIP
signaling delays are estimated to be larger than those shown, due to
the increased complexity of the whole network system.
21.5.5 RTCP Performance
The RTCP protocol basics have been introduced in Chapter 4, Section 4.5.
We recall the fact the RTCP is used by a receiver to provide QoS
information to the transmitting party in order to repair the media
transmitted or, in general, to take some (possibly prompt) action to
adjust or improve the QoS toward the receiver.
RTCP packets are normally sent with a minimum interval
of 5 seconds. However, some applications may benefit from sending a
more-frequent feedback. Video telephony can certainly benefit from a
faster feedback, because this allows a faster reaction of the sender
terminal to provide a better QoS to the receiving terminal. One
possible action is to change on the fly the encoding parameters when
the packet loss rate increases. This action should be taken as early as
possible in the transmitting terminal, and a 5-second interval could be
too long a time window to allow a fast reaction, especially if the
media to repair is a speech stream.
The new RTP specifications [74], [75]
define a more-flexible use of the RTCP data flow, allowing
more-frequent feedback by reducing the transmission interval to a value
lower than 5 seconds or by fixing the percentage of the RTP session
bandwidth reserved for RTCP traffic.
We have run some tests for 1-minute speech and video
streams. The former was encoded using the AMR codec at 12.2 kbps with
silence suppression. The latter was encoded using the H.263+ video
codec at 64 kbps. For the speech session the maximum RTCP packet length
was 168 bytes (including UDP/IP headers, a sender report and full
SDES), while for the video session the RTCP packet length was 88 bytes
(including UDP/IP headers, a receiver report and SDES). Results for
different RTCP bandwidth percentages are shown in Tables 21.7 and 21.8.
Table 21.7: Results for Different RTCP Bandwidth Percentages (AMR Speech)
|
RTCP Bandwidth (%) |
RTCP Bandwidth (kbps) |
Average RTCP Interval (ms) |
Number of RTCP Packets |
|
1.0 |
0.19 |
5000 |
12 |
|
1.6 |
0.28 |
3158 |
19 |
|
1.9 |
0.35 |
2609 |
23 |
|
2.2 |
0.39 |
2308 |
26 |
Table 21.8: Results for Different RTCP Bandwidth Percentages (H.263+ Video)
|
RTCP Bandwidth (%) |
RTCP Bandwidth (kbps) |
Average RTCP Interval (ms) |
Number of RTCP Packets |
|
0.2 |
0.14 |
5000 |
12 |
|
0.3 |
0.18 |
4000 |
15 |
|
0.5 |
0.34 |
2069 |
29 |
|
1.0 |
0.67 |
1053 |
57 |
|
1.2 |
0.82 |
857 |
70 |
|
1.9 |
1.31 |
536 |
112 |
|
2.4 |
1.63 |
432 |
139 |
|
2.8 |
1.88 |
375 |
160 |
The leftmost column in Tables 21.7 and 21.8
contains the RTCP bandwidth as percentage of the RTP session bandwidth,
which includes media and headers overhead (including RTP/UDP/IP
headers). The second column contains the RTCP bandwidth in kilobits per
second. The third column of data is the computed average RTCP interval
between two QoS reports. The last column contains the number of RTCP
packets sent by a receiver during 1 minute of data reception.
The reader can see that the minimum bandwidth
occupied by RTCP is below 0.2 kbps when a 5-second transmission
interval is used. In this case the receiver sends only 12 QoS reports.
When increasing the bandwidth reserved for RTCP, more-frequent feedback
can be sent. For example, for speech traffic, a feedback message every
2.3 seconds would allow 26 QoS reports in one minute. In the same way
for video traffic, a feedback message every 375 ms would allow 160 QoS
reports in one minute. This would let the transmitting terminal have
more possibilities to adjust the error resilience properties of the
video stream. Theoretically, it could be possible to have 160 QoS
reports for the speech stream as well. However, this would imply an
RTCP bandwidth of about 2.4 kbps, i.e., over 13 percent of the RTP
session bandwidth for speech. The reader interested in details about RTCP traffic can refer to Curcio and Lundan. [76], [77]