Facilitating open standards and open source software to assembly a platform for Networked Music Performance

Introduction

Teleconferencing and Voice over IP (VoIP) technologies have a history of more than thirty years now and are being widely used for an abundance of daily communication activities including project meetings, social gatherings, e-learning and so on. They are possibly the most prominent type of groupware in computer supported cooperative work. Besides proprietary software offering synchronous audio and video communication (e.g. Skype, Google Talk, MSN, XLite, Polycom software), there are currently a number of open source software initiatives that are available to facilitate in synchronous collaboration among remotely located peers. For example BareSIP, Linphone, Jitsi and sipML5 are some of the currently available soft-phones, while Asterisk and FreeSWITCH present some examples of VoIP servers widely offered under an open source license.

This chapter focuses on a distinct category of teleconferencing applications, that of Networked Music Performance (NMP). NMP allows remotely located musicians to engage in synchronous music performances using computer networks and dedicated software tools. NMP systems are inherently related to teleconferencing systems, as they are a form of groupware helping musicians to collaborate on building up a musical work. In contrast with conventional teleconferencing, NMP endeavours to encourage emotional and artistic expression. This fact demands for a severely different development approach that can ensure uninterrupted communication both with respect to technical efficiencies of the communication platform, as well as in terms of the collaboration affordances offered to music performers.

For instance in teleconferencing, delay requirements are dictated by the needs of speech-based human interaction, and are of the order of 150 ms (Wu et al. 2007). Compared to teleconferencing, studies of ensemble performance (Driessen et al. 2004; Chew et al. 2005; Chafe et al. 2004; Schuett 2002), dictate a significantly lower threshold of latency tolerance, which is of the order of 30ms. This latency threshold is known as Ensemble Performance Threshold (EPT). Furthermore, high fidelity audio requirements demand for excessive network bandwidth availability as well as complete elimination of distortions owing to network packet loss. For example in telephony, speech signals are sampled at the rate of 8 kHz with 8 bits per sample, while music quality is generally considered unacceptable when sampled at a rate below 44.1kHz, corresponding to ten times more information in the case of monophonic audio encoded using with 16 bits per sample.

Nevertheless despite these differences, teleconferencing technologies present some interesting features that are not commonly available in existing software infrastructures for NMP. For example, signalling communication protocols, such as the Session Initiation Protocol (SIP), alleviate from a number of problems, including network port and firewall configuration, which may be particularly burdensome when attempting to establish an NMP link. Signalling messages serve the purpose of easing user contact by automatically configuring various connection parameters that are seamless to users. For example, signalling alleviates from the need of knowing each other’s IP address and available network ports, overcoming firewall issues such as NAT traversal and automatically configuring media codec parameters. As a result, signalling allows offering user functionalities such as maintaining a list of contacts, showing the availability status of other users (e.g., online, busy, etc.) and initiating audio-visual communications without the need for sophisticated technical configurations. Signalling is mostly used in videoconferencing systems, but has also been considered in the context of NMP research (e.g. Lazzaro Wawrzynek 2001). On the other hand, it is important to acknowledge that user capabilities offered by signalling protocols have significantly contributed to the widespread use of teleconferencing software in our daily lives.

The goal of this chapter is to investigate the perspective of tailoring open source software components (i.e. both clients and servers) for teleconferencing applications, so as to accommodate the requirements of NMP. The remaining part of this chapter is structured as follows: The next section presentsa short overview of early experimental attempts and current progress in NMP research. The section that follows discusses fundamental issues that need to be taken into consideration when developing a NMP system. This concerns software applications, architectural topologies and issues that are inherently related to computer network communications. Then the subsequent section provides an overview of the network protocols and Quality of Service approaches that are commonly used in videoconferencing and VoIP systems. The section distinguishes between signalling, QoS and media transport protocols. Following, the chapter presents our review of open source software initiatives implementing these protocols. These are initiatives that can be customised to accommodate the requirements of NMP. Comparing these initiatives with respect to their suitability to offer efficient networked music collaborations concludes the section. Subsequently, the chapter presents our pilot developments for an NMP software architecture based on freely available software components for low-latency and high fidelity media communication. Finally, the chapter summarizes the contributions of this writing and reveals perspectives for further investigation.

NMP: Early attempts and follow-up advancements

Physical proximity of musicians and co-location in physical space are typical pre-requisites for collaborative music performance. Nevertheless, the idea of music performers collaborating across geographical distance was remarkably intriguing since the early days of computer music research.

The relevant literature appoints the first experimental attempts for interconnected musical collaboration to the years of John Cage. Specifically, the 1951 piece “Imaginary Landscape No. 4 for twelve radios” is regarded as the earliest attempt for remote music collaborations (Carôt et al. 2007). The piece was using interconnected radio transistors, which were influencing each other in respect with their amplitude and timbre variations (Pritchett 1993). A further example, the first attempt of performing music using a computer network, in fact a Local Area Network (LAN), was the networked music performed by the League of Automatic Music Composers, which was a band/collective of electronic music experimentalists active in the San Francisco Bay Area between 1977 and 1983 (Barbosa 2003; Follmer 2005). The League realised the computer network as an interactive musical instrument made up of independently programmed automatic music machines, producing a music that was noisy, difficult, often unpredictable, and occasionally beautiful (Bischoff, Gold, and Horton 1978).

These early experimental attempts are predominantly anchored on exploring the aesthetics of musical interaction in a conceptually ‘dissolved and interconnected’ musical instrument. The focus seems to be placed on machine interaction rather than on the absence of co-presence, as in both of these initiatives the performers were in fact co-located. Telepresence across geographical distance initially appeared in the late 1990s (Kapur et al. 2005) either as control data transmission, noticeably using protocols such as the Remote Music Control Protocol (RMCP) (Goto et al. 1997) and later the OpenSound Control (Wright Freed 1997), or as one way transmissions from an orchestra to a remotely located audience (Xu et al. 2000) or a recording studio (Cooperstock Spackman 2001).

True bidirectional audio interactions across geographical distance became possible with the advent of broadband academic network infrastructures in 2001, the Internet2 in the US and later the European GEANT. In music, these networks enabled the development of frameworks that allowed remotely located musicians to collaborate as if they were co-located. As presented by the Wikipedia[1], currently known systems of this kind are the Jacktrip application developed by the SoundWire research group at CCRMA in the University of Stanford (Cáceres Chafe 2009), the Distributed Immersive Performance (DIP) project at the Integrated Systems Center of the University of Southern California (Sawchuck et al. 2003) as well as the DIAMOUSES project conceived and developed at the Dept. of Music Technology and Acoustics Engineering of the Technological Educational Institute of Crete (Alexandraki Akoumianakis 2010).

These systems permitted the realization of distributed music collaborations across distance. In practical terms, this translates to audio signals generated at one site arriving at a remote network location, within an acceptable time interval and sound quality Such behaviour would effectively resemble collocated music performance. Unfortunately, the widely available Digital Subscriber Lines (xDSL) are not capable of coping with the requirements of live music performance, thus musicians are not offered the possibility to experiment with such setups.

NMP Implementation fundamentals

In NMP research, the introduced communication latency is often thought of as comprising local latencies and network latencies. Respectively, two distinct entities are studied when developing NMP systems: the software facilitated by music performers and the communication medium, i.e. the computer network. In the majority of cases, NMP progress is concentrated on software development. Issues that are inherently related to the communication medium are less often addressed (e.g. Kurtisi and Wolf 2008; Lazzaro Wawryznek 2001). The next section proposes some criteria that need to be met in NMP communications, while the subsequentsubsections provide an overview of the characteristics of NMP systems, both with respect to software applications as well as in terms of network infrastructures.

Design Criteria

As already discussed in the introduction, the most outstanding difference of NMP from videoconferencing systems relates to the quality of audio transmission. Compared to speech, music has high-fidelity requirements that considerably increase the requirements in network bandwidth availability. Hence, widely available network infrastructures demand for facilitating appropriate audio compression schemes. Such schemes should significantly reduce data rates without perceivable degradation of audio quality and, moreover, without introducing noticeable latency in the entire process of capturing – encoding and transmitting audio. Luckily, at present the development of new audio codecs is increasingly taking into account the real-time requirements of NMP systems. The royalty free Constrained-Energy Lapped Transform (CELT) codec (Valin et al. 2009) provides evidence for this tendency. The CELT codec, which is the present state of the art in ultra low-latency and low-bit rate communication, has been recently integrated in the OPUS codec, which, at the time of this writing, provides a de facto standard for interactive audio applications over the Internet (IETF 2012). Consequently, in this study, the most outstanding criterion for customising videoconferencing software for the purposes of NMP is the possibility to support such ultra low-latency and high quality audio codecs.

On the other hand, as music performance is a highly complex cognitive activity, emphasis must be given in developingsoftware applications that are easy to use. This goal refers both to the usability and intuitiveness of the Graphical User Interfaces to be developed as well as to the ability of the final software to automatically cope with diverse network settings, such as communication behind firewall, NAT traversal and so on.

Moreover, with respect to communication topologies, a NMP system mustsupportmultistream and multipoint connections. Multistream support refers to the capability of the client application to activate multiple simultaneous capture and playback of audio and video. Multipoint connectionsreferto the possibility of involving multiple network peers, at least more than two, in the same NMP session, hence allowing for multiparty conferencing communications.

Yet, a further design criterion concerns the possibility to offer efficient Quality of Service (QoS). The term QoS is used to describe the means to prioritize network traffic, so as to help ensure that the most important data gets through the network as quickly as possible. Without properly configured QoS, the quality of audio and video will be reduced due to the overall demands of general traffic. In general however, both the network and the communicating hosts contribute to delay; hence they all need to be considered as part of the QoS puzzle. In the following paragraphs, these aspects are described in more detail.

Software applications

By overlooking their focus on musical interaction, NMP systems could be categorized as “ultra low-delay” and “ultra high-quality” teleconferencing systems. As in teleconferencing, NMP commonly necessitates the use of video, in addition to audio communication. Evidently, musicians establish eye contact to synchronize their performance especially after a pause. Such visual communication is also time-critical in musical interactions (Sawchuck et al. 2003). If the software application used by music performers does not support video communication, then some external teleconferencing application (e.g. Skype) is often facilitated for their visual communication (Chafe 2011).

Client application

Typically, the NMP client, i.e. the software application used by music performers to engage in distributed performances, implements the functionalities depicted on Figure 1.

In most cases, a dedicated Graphical User Interface (GUI) will be facilitated by musicians to activate the different communication channels. Communication is achieved by means of media (i.e. audio and video) transmission, media reception and signalling. As was already mentioned in the Introduction, signalling protocols alleviate from a number of configuration burdens, requiring computer network expertise that is not commonly available to musicians, and have significantly contributed to intuitive and widespread use of teleconferencing systems. With respect to media communication, Figure 1 depicts the processes that need to take place prior to network transmission and subsequent to network reception. Each of these processes adds to the local latency, hence having its own contribution to the total mouth-ear latency, a common term in audio telecommunications.

Figure 1: Typical components of an NMP client application.

Focusing on audio communication and the transmission direction, the delay introduced by the audio capturing process can be further broken down to: the delay of the physical distance of the performer to the microphone, that of analogue to digital conversion and more importantly the buffering delay. Before further processing, a sufficient portion of the signal needs to be obtained. The size of this portion corresponds to a time interval commonly referred to as buffering or blocking delay. For example, in the case of capturing monophonic 44.1kHz audio, a buffer of 64 samples corresponds to 1.4ms, 256 samples correspond to 5.8ms and 1024 samples correspond to 23.2ms. Hence, the size of the audio buffer should be appropriately eliminated to correspond to latencies that are sufficiently lower than the EPT.

In some cases, compression encoding follows audio capturing. Audio compression aims at reducing the size of the information to be transmitted, hence eliminating the required network bandwidth. It is straightforward to estimate that raw monophonic CD quality audio (44.1kHz/16bit) corresponds to a bit rate of 705.6kbps, while the stereo signal requires 1.41Mbps. Clearly, requiring more audio channels or higher quality audio in terms of sampling rate or bit resolution further increases the data rate and hence the required network throughput.

These bitrates cannot be continuously available during NMP. Thus, some NMP approaches employ compression encoding to reduce the required bandwidth (e.g. Polycom 2011; Kurtisi Wolf 2008; Kraemer et al. 2007). Nevertheless, some NMP systems, and especially those intended for use over academic networks (Alexandraki Akoumianakis 2010; Cáceres Chafe 2009; Sawchuck et al. 2003) do not use audio compression. The choice of whether or not to use audio compression is primarily determined by the latencies introduced by the compression codec. Encoding latencies comprise both delays caused by the algorithmic complexity of the encoder as well as buffering delays. Compression schemes conventionally require a sufficient amount of data (hence increasing the buffer size), so as to effectively encode data streams and offer commendable data reduction.

Further to compression encoding, an NMP client may optionally perform multiplexing. Multiplexing serves the purpose of combining multiple data streams into asingle stream, so as to eliminate the need for using an additional network port, hence a separate configuration at the receiving end, for each individual stream. For example multiple streams of audio or video could be combined in a single stream. Multiplexing is generally a lightweight process that does not significantly add to the overall latency.

Finally, before departing to the network the possibly encoded and multiplexed audio chucks are wrapped in network packets. Apart from the main data, i.e. the payload, network packets include header information. Header information is determined and structured according to the network protocol facilitated for media transmission. Header information is necessary and among other things defines the destination of each network packet. It is important to note that header information adds to the total data rate, hence increasing the required network bandwidth. A research initiative attempting to reduce header overhead in NMP has been presented by Kurtisi Wolf (2008).

It can be seen from Figure 1, that the inverse processes takes place in the direction of media reception. Although processes such as audio decoding and audio rendering are more lightweight than encoding or capturing, media reception is not necessarily more efficient than media transmission. This is due to the fact that in the event of multiple network peers participating in an NMP session, a separate reception thread is instantiated for each of the remaining collaborators.

Server application

Although a number of NMP systems facilitate peer-to-peer communication topologies (e.g. Jacktrip), some approaches facilitate a server so as to ease media communications. The server may undertake various duties, such as media transcoding, media synchronization or media mixing (Kurtisi Wolf 2008; Alexandraki Akoumianakis 2010). As each of these functionalities has a certain amount of computational complexity, hence requiring increased processing resourcesthat may further add to the overall latency, it is most often preferred to reduce server functionality to mere forwarding the incoming media streams to the intended recipients This mechanism known as media relaying.

Figure 2: Peer-to-peer vs. centralised media communication in NMP.

As shown on Figure 2 and experienced in the DIAMOUSES architecture (Alexandraki Akoumianakis 2010), if N network nodes participate in an NMP session, then a peer-to-peer topology requires each node to transmit the media streams locally produced to the remaining N-1 nodes, and at the same time receive the streams from the remaining N-1 participants. This is particularly burdensome and even more so in widely available network infrastructures (i.e. xDSL), in which the uplink suffers from serious bandwidth limitations. An alternative is to use the star topology depicted in Figure 2. In this case, each network node transmits the streams produced locally to a single network location, i.e., to the server. The responsibility of this server is to relay each received stream to the remaining nodes in a single-source-multiple-destination communication scheme. This topology offers the advantage of significantly relieving the client node from high outbound bandwidth requirements.