Hierarchical resource allocation for robust in-home video streaming

Peter van der Stok1,2, Dmitri Jarnikov1,2, Sergei Kozlov1, Michael van Hartskamp2, Johan Lukkien1

1, Eindhoven, Technical University, Netherlands; 2, Philips Research, Eindhoven, Netherlands

Abstract High quality video streaming puts high demands on network and processor resources. The bandwidth of the communication medium and the timely arrival of the frames necessitate a tight resource allocation. Given the dynamic environment where videos are started and stopped and electro-magnetic perturbations affect the bandwidth of the wireless medium, a framework is needed that reacts timely to the changes in network load and network operational conditions. This paper describes a hierarchical framework, which can handle the dynamic network resource allocation in a timely manner.

1.Introduction

Today, TVs, video recorders and set-top boxes are mostly interconnected with SCART cables. The advent of broadband and the introduction of the second PC in the home make digital home networks a more realistic way to interconnect Consumer Electronic devices. This tendency moves video away from dedicated media to video streaming across an open and shared network connecting multiple types of devices (e.g. phones, PCs, and CE-devices). This new multimedia environment introduces the problem of sharing limited network resources between video- and other applications. Both the need for resources, expressed as bit rate, and the availability of resources, expressed as bandwidth, fluctuate within intervals of tens of milliseconds. In addition there is a need for a timely delivery of the video at the destination. The source of the video can be either live (broadband TV or a camera) or taken from a storage medium. For a live transmission, a low overall delay from generation to displaying is mandatory, which imposes strict timeliness requirements. The source can be located inside the home, outside, or connected through some gateway (e.g. a broadband connection). The quality of the source video can vary from relatively poor - in the order of tens of kbits/s - for use on a small display, to high quality - in the order of 6 to 40 Mbits/s - for use on a large screen flat TV.

The sharing of the network medium among several applications leads to a lower bandwidth available to a given video. In addition, a significant part of the home network will be based on wireless technology. The consequences of wireless connections are reduced security and bandwidth, as well as increased fluctuation of the bandwidth through interference with other transmission sources and the moving of objects.

Not meeting the resource and timeliness requirements leads to non-optimal viewing experiences in the form of distortion, hiccups, delayed viewing or stalling. To avoid these severe quality changes (leading to people refusing to buy networks and TVs) we propose a scheme that allocates the resources in such a manner that under overload a tolerable quality degradation occurs such that recognizable video is provided at all times. The scheme combines the video-source, the video coding, and the transport protocol and is especially advantageous for live broadcasts. It distinguishes fast fluctuations at the frame level (≈ 40 ms) from structural fluctuations. Fast fluctuations are caused by variations in the frame sizes and distortions in the bandwidth. Structural fluctuations last longer and come for example from the starting/stopping of another application.

It is not sufficient to come up with technologically viable solutions. Because many manufacturers provide network equipment and CE-devices, inter-operability must be assured. The situation in the home is very different from telephone- or service provider networks. The networks for the latter are under control of one operator who decides the resource management procedures and technology. In the home there is no such authority, and standardization must assure that the provided technology collaborates to support the policies wished by the users of the home network.

2.Related work

Most CE-devices (TV, DVD) are resource-constrained to put them in an acceptable price range. For telephones, the same resource constraints are mostly driven by battery power constraints. The work on video streaming most related to our work can be distinguished in three areas (1) scheduling of packets on the link, (2) adaptation of processing power to video requirements in the renderer, (3) multicasting video with a bit-rate adapted to the individual capacities of the receivers, and (4) transporting video over a network.

Scheduling packets. The packets of all applications, which share the network, need to be scheduled. It is assumed that a network authority (as foreseen by UPnP QoS standardization[32][33]) allocates bandwidth to the individual applications. To prevent buffer overflows and provide a balanced loading of the network, packets are scheduled such that the consumed bandwidth is not exceeded and the network load is evenly distributed in time. The unscheduled packets are viewed as interference to the scheduled applications. Leaky bucket and token bucket techniques are two well-known examples [2]. General Processor Sharing (GPS) is an ideal scheduling mechanism, which is applied to networks. Two examples show how scheduling techniques can allocate bandwidth to streams or separate asynchronous traffic from synchronous (video) traffic [22][23].

Processing power in CE-devices. In [7], and [17] the observation is made that the CPU needs for a video stream fluctuate from frame to frame and from scene to scene. A distinction is made between the fast fluctuation at the frame level, and the slower fluctuation at the video scene level. The concept of Scalable Video Algorithm (SVA) is introduced to adapt the quality of the decoding process to control the CPU requirements. In this framework it is possible to provide the highest quality while still meeting the deadline of each frame. In [8] the authors describe the allocation of priorities to video frames such that important video frames, on which other frames depend, have a higher probability of acquiring CPU resources to maintain the highest possible video quality under processing overload.

Multicasting. Devices have different processor/memory capabilities, thus not every device may be capable of processing all video data that is streamed by the sender. To ensure that all devices process video according to their capacity, a sender sends to each destination the amount of data that the device can successfully process. In [16] the sender adapts the content. There are several strategies for content adaptation. The three foremost of those strategies are the simulcast model, transcoding and scalable video coding model. With the simulcast approach [16] the sender produces several independently encoded copies of the same video content, which differ in e.g. bit/frame rates, and spatial resolutions. The sender delivers these copies to the destinations, in agreement with the specifications coming from the destination.

Video transport. Two protocols are generally deployed for the transport of video over a network with the Internet protocol (IP) facilities defined by Internet Engineering Task Force (IETF): Transmission Control Protocol (TCP) [20], and Real-time Transport Protocol (RTP) [19]. RTP is very successful on the Internet to support live video but the quality displayed on the PC is far below the quality accepted by buyers of digital TV. RTP promotes timely arrival of packets by allowing loss of packets. Efforts are ongoing to extend RTP with retransmission facilities as provided by TCP [34]. TCP provides loss-less transport of packets but badly supports live streaming over the Internet. The TCP-RTM protocol is an effort to provide timeliness to audio packets by skipping late packets [21].

The fluctuating bandwidth of the wireless medium has as consequence that the quality of the rendered video fluctuates. In [14] a controller at the receiver side removes the quality fluctuations by selecting the transmitted video parts such that the quality fluctuations are reduced.

Standardization for the home.

Only a few years ago, IEEE 1394 [25] and HiperLAN [26] were considered as promising candidates for connectivity in consumer electronics home networking as they offered timely delivery of packets and are therefore suited for multimedia transport. Now wired and wireless Ethernet are recognized as the predominant connectivity standards. The advantage is that only one technology is used for all networking applications in the home (e.g. file transport, chatting, audio, and video). Yet, it offers only best-effort but no timeliness guarantees. The IEEE 802.11e standard [27], which provides extensions to wireless Ethernet, offers prioritized and scheduled access. It also offers several other enhancements. Before the IEEE 802.11e standard was completed in 2005, the Wi-Fi Alliance had started a certification program for wireless multimedia based on IEEE 802.11e. The program for prioritized access, called Wi-Fi Multimedia (WMM) [28], was completed in 2004. The Wi-Fi Alliance is currently working on a certification program for scheduled access. Several other connectivity technologies have recently been developed that include scheduled access: WiMedia [29] HomePlug AV [30], etc. Even for wired Ethernet, the Ethernet AV initiative [31] intends to improve its timeliness properties.

As it is expected that home networks remain heterogeneous in their connectivity technologies, middleware solutions are developed to deal with this heterogeneity. One of the more popular middleware technologies for the home is UPnP [32]. The UPnP forum standardizes so-called device control protocols for e.g. AV applications, Internet Gateway devices, but also for Quality of Service (QoS). The UPnP-QoS v1 and ongoing v2 [33] specifications define the use of priority-based policies. Currently work in UPnP-QoS is ongoing to develop version 3 on parameterized QoS and scheduled access. The UPnP-QoS makes it possible to share bandwidth in accordance to application quality criteria. As such it becomes possible to share the network between different types of applications. For example, the real-time aspects of audio and video streaming can be guaranteed at the expense of delays for file transport

The Digital Living Network Alliance (DLNA) is an industry forum that provides interoperability guidelines to implement digital media servers, players and renderers [34]. Guidelines are written on the use of wired and wireless Ethernet and Bluetooth, TCP/IP and UPnP. TCP is the mandatory transport protocol for AV content. The DLNA also defines various profiles for AV media formats [35].

3.Video transport

This section motivates our choice of TCP and the deployment of scalable video coding. Figure 1 shows the most important features of the network configurations we consider.

Figure 1 Example network configuration

The sender contains a video application, which sends stored or live video to a destination on the network. It invokes a transport protocol, which packs the video frames in packets and the traffic shaper sends the packets in a regular fashion to the network. The network is composed of a wired (switched Ethernet) part and a wireless part. Packets are buffered and sent on in the switch and the Access Point (AP). Losses due to buffer overflow may occur at the sender, the switch, the AP and the receiver. For today’s wired segment there is almost no packet loss. In addition, measurements showed that losses over the wireless segment do not occur when the retry counter is set to 4 or higher [24]. Consequently we may assume that packet loss over the communication media in the home is negligible, and most of the time losses occur due to buffer overflow.

3.1.Transport protocols

When unreliable transport protocols (such as RTP, [19]) are used for sending multimedia streams over a large network (the Internet), it is very difficult to control the data losses happening due to congestion in the routers or due to the low reliability of the medium, e.g. a wireless link. Usual practice to handle such losses is to use error recovery mechanisms at the receiver and/or redundancy coding at the sender. However, these mechanisms are often combined with some content adaptation technique, which uses the feedback from the network or receiver to inform the sender about the losses. This makes the system difficult to implement.

TCP, being a reliable transport protocol, eliminates uncontrollable losses of data. Applications built upon TCP see the network as reliable transport means with varying throughput. Nevertheless, if at some moment the application needs to send more data than TCP can deliver (the network bandwidth drops below the bit rate of a live encoded stream), loss of data can happen due to application/TCP buffer overflows. Introducing larger buffers may decrease the probability of buffer over- and under-flows, because often the network throughput drops below the video bit-rate only for a short time, after which the “recovery” takes place. The larger the buffers, the longer the periods of insufficient bandwidth can go unnoticed by the end-user. The cost for the large buffers is increased latency (the time needed for a unit of data – e.g. a video frame – to be transferred from the sender to the receiver). Latency of more than 200 ms, which corresponds to buffering of 5-7 frames, is not acceptable in real-time video applications. Keeping the buffers small, limits the amplitude and duration of bandwidth variations that can be handled. However, these losses are easy to control by applying buffer management techniques that are different from the default Tail-Drop technique. Such techniques as Partial Buffer Sharing (PBS) and Triggered Buffer Sharing (TBS) [4], as well as Push out Buffers (POB) [5] are based on dropping lower priority data to accommodate higher priority data when the buffer cannot accommodate both. Implementations of buffer management techniques in the video-streaming domain address frame skipping approaches and scalable video coding methods (see below).

TCP selection. The major drawback of TCP is its stalling behavior and its slow start-up after an end-to-end packet loss. However our measurements (section 5.1) indicated that end-to-end packet losses occur rarely in home networks and are always immediately restored by acknowledgements resent within 20 ms. Consequently, stalling behavior is almost completely eliminated. On the positive side, the flow control of TCP adapts the bit rates of the packets to the bandwidth availability. For live video, packets are lost in the sender buffer of the application, but the same application has access to this buffer and can decide which parts can be removed. In contrast, RTP just goes on sending packets leading to uncontrolled losses.

3.2.Video frames

A MPEG-2 video stream is built up of I, P and B frames. Each frame represents one picture. An I-frame contains enough information to be decoded independently. A P-frame needs additional information from a directly preceding I-frame or P-frame. Motion vectors describe how a part of the referenced picture must be moved for a correct visualization in the frame to decode. B-frames need additional information from two frames, a succeeding P- or I-frame and another preceding P- or I-frame. The video is structured in Groups of Pictures (GOP), containing one I-frame followed by a sequence of B- and P-frames. Examples of legal GOPs are I(I), IPP(I), IBPBPB(I) or IBBPBBPBB(I). The (I) denotes the start of the next GOP.

A scalable video coding scheme describes an encoding of video frames into multiple layers, including a Base Layer (BL) of basic quality and several Enhancement Layers (EL) containing increasingly more video data to enhance the quality of the base layer and thus resulting in video of increasingly higher quality [18]. Scalable video coding is represented by variety of methods that could be applied to many existing video coding standards [10][11][12]. These methods are based on principles of temporal, signal-to-noise ratio (SNR), spatial and data partition scalability [15]. In our framework we use a specific form of temporal scalability that we call I-Frame Delay (IFD) and a form of SNR scalability that is resistant against packet losses.

I-Frame Delay.IFD represents a temporal scaling technique. When the network bandwidth drops below the bit rate of the video, temporal scalability decreases the bit rate of the video by dropping video frames without influencing the quality of the surviving frames. A reasonably low amount of dropped frames might not be noticeable by the end-user. However, dropping frames arbitrarily (as it would be in case of Tail-Drop buffer handling) is not a good idea because the impact of dropping MPEG frames has an impact on the end-user perceived quality dependent on the frame type (I, P, or B). We use the frame type to guide the frames skipping process as follows: when the sender buffer gets full, IFD will push the B frames out of the buffer first, and then, (i.e. the bandwidth has dropped significantly for a longer period), the P and I frames.

The cumulative weight of B frames in a MPEG-2 stream often comes to 50% and more. This means that by only dropping all B frames we can make the resulting example video stream fit into a bandwidth that is half the bit rate of the original stream, still preserving inter-frame dependencies. The price for this will be a decreased frame-rate - in the case of an IBBPBB(I) GOP structure, all B frames dropped would lead to 1/3 of the original frame-rate and 1/2 of the original bit rate.