Peer-To-Peer Content Distribution

Gareth Tyson

“Peer-to-Peer Content Distribution”

B.Sc. (Hons) Computer Science

March 2005

“I certify that the material contained in the dissertation is my own work and does not contain significant portion of unreferenced unacknowledged material. I also warrant that the above statement applies to the implementation of the project and all association documentation.

Regarding the electronically submitted version of this submitted work, I consent to

this being stored electronically and copied for assessment purposes, including the

Department’s use of plagiarism detection systems in order to check the integrity of

assessed work.

I agree to my dissertation being placed in the public domain, with my name

explicitly included as the author of the work.”

Signed,

______

Gareth Tyson

ABSTRACT

Since the emergence of the Peer-to-Peer revolution the proliferation of the technology has been unstoppable, its many benefits have helped to numb the pain of its inefficiencies but current designs mean that without severe modification to their scalability it will be simply impossible to service the requirements of the growing P2P community.

This project aims to improve the search algorithms of current P2P technology whilst still maintaining some of the better features. This report outlines the design, development and evaluation of the creation of such a P2P network with the emphasis on evaluating the chosen design and its possible derivatives.

ABSTRACT……………………………………………………………………………....3

CONTENTS………………………………………...……………………………………4

INTRODUCTION…………………………………………………...….7
General Overview of Peer-toPeer………………………………………………7
Project Aims……………………………………………………………………..8
Report Overview………………………………………………………………...9
BACKGROUND………………………………………………………11
Terminology of P2P……………………………………………………………11
Analysis of P2P Architectures………………………………………………...12
Gnutella...... 12
Kazaa...... 15
Napster...... 17
Pastry...... 18
Legal Background...... 19
Data Resource Management (DRM)...... 19
The Generations of DRM...... 20
Encrypted Shell...... 20
Marking...... 20
Centrally Controlled DRM...... 21
Problems with DRM...... 21
Choices in Software and Technology...... 21
C...... 22
Java...... 22
Java RMI...... 22
JXTA...... 22
Justification for Building a New Type of Network...... 23
Summary...... 26
DESIGN...... 27
General Overview of the System...... 27
Overall Structure...... 27
Network...... 27
Connecting New Nodes to the Network...... 28
Creating the Indexed Structure of the Network...... 29
Creating New Supernodes...... 29
Dealing with the Loss of Supernodes...... 30
Keeping File Lists up to Date...... 31
Distributing File Lists to Supernodes...... 32
Searching...... 32
Routing of Search Queries...... 33
Search Aggregation...... 34
How Supernodes Search their own Databases...... 34
How Supernodes Perform Fuzzy Searches...... 34
Downloads/Uploads...... 35
System Architecture...... 37
Class Overview...... 39
Design Patterns...... 41
User Interface...... 42
Summary...... 46
IMPLEMENTATION...... 47
The GazNet Protocol...... 47
000-PRESENT – 001-PRESENT? – Pinging a Node...... 49
030-CONNECT? – The connection process...... 49
080-SEARCH=search string – The Searching Process...... 50
120-DOWNLOAD=filename=x=y – The Download Process...... 50
190-NEW_SUPERNODE – Supernode creation...... 51
400-ALPHA_LOCATIONS_UPDATE...... 51
The ProtocolSocket...... 51
Data Structures...... 52
Range...... 52
Group...... 53
The FileDatabase...... 53
Important Algorithms...... 55
The Splitting Algorithm...... 55
Multi-Source Downloads...... 56
Aggregating Search...... 58
Summary...... 59
THE SYSTEM IN OPERATION...... 60
A Typical Session...... 60
Summary...... 64
PROCESS DESCRIPTION...... 65
The Connection Process...... 65
The Searching Process...... 65
The File Transfer Process...... 66
The Splitting Process...... 67
Summary...... 67
TESTING...... 68
The Test Bed...... 68
How the System was Tested...... 68
Testing Utilities...... 69
The Simulators...... 69
System Test Results...... 70
Summary of Errors...... 75
Networking...... 75
File Transfers...... 76
The User Interface...... 77
Incremental Testing...... 77
Summary...... 79
EVALUATION...... 80
Evaluation of Architecture...... 80
Scalability...... 80
Search Algorithms...... 87
Reliability and Robustness...... 92
File Transfers...... 93
Miscellaneous...... 94
Evaluation of User Interface...... 96
The Welcome Panel...... 96
The Search Panel...... 96
The Download Panel...... 96
The Upload Panel...... 97
The My Files Panel...... 97
Alternative Solutions...... 97
Scalability...... 97
Robustness and Reliability...... 103
File Transfers...... 104
Critical Summary of Evaluation...... 105
CONCLUSION...... 106
Review of Aims...... 106
Review of Original Project Proposal...... 109
Suggested Revisions to Design...... 110
Suggested Revisions to Implementation...... 111
Future Work...... 112
Lessons Learnt...... 113
Final Conclusion...... 113
REFERENCES...... 115
Appendix – Original Project Proposal...... 117

The project’s working documents are available at “

1. Introduction

1.1General Overview of Peer to Peer

One of the current major issues in computer science is the concept of Peer to Peer (P2P) computing. The purpose behind P2P is to allow a community of users to be able to carry out services such as sharing files without the need for a central server to run the network. In recent years the proliferation of P2P technology has been immense allowing massive distribution of digital media by parties which previously would simply not have had the resources to perform such a large scale task. The idea behind P2P is that a number of peers can form together into a network without the need to be managed by another party; instead they manage themselves, dynamically changing to deal with the addition and removal of peers. This is obviously an attractive idea and has been picked up by many companies and programmers wishing to create such networks.

The first major P2P network was Napster, which allowed peers to share and download music files. This network fulfilled many of the end requirements of this project, primarily the fast search time, however Napster’s reign of supremacy wasn’t to last forever as, for legal reasons, it was shut down, leaving a gaping cavern left to be filled by other less efficient networks such as Gnutella and Kazaa. The inefficiency of these networks is exactly the reason for this project as it strives to improve upon the glaring problems with performance and scalability that resonate throughout current P2P technology.

The intention of this project is to build a new P2P network, titled ‘GazNet’ that will improve on existing networks, in particular it aims to speed up the process in which a peer searches the network to find content. Currently the majority of P2P networks have extremely slow search times, this is because in many cases every peer in the network must be contacted to complete a search. This can be both highly frustrating for the user and a heavy burden on the underlying network. In the previous sentence it was said that the majority of P2P networks have extremely slow search times, not the entirety; this is because there are networks available that can perform high speed searches, but this performance comes at a price. These networks require the user to enter the exact name of the file they are looking for; the slightest deviation from the correct name will result in no results being returned. This fallibility has meant that there has been little popular interest in them, simply because the average user would rather wait a long time and receive a plethora of results, rather than wait a short time and receive no results. This project aims to increase search performance without the sacrifice of ‘fuzzy’ searching i.e. the ability to enter a search word and receive results even though the filename doesn’t exactly match that word, e.g. a search for ‘Help’ would return the file ‘The Beatles – Help!.mp3’. By fulfilling this requirement a network will have been devised that will no longer force the user to choose between search time and the effectiveness of searches.

1.2 Project Aims

1)To design, develop, and evaluate a P2P network to allow the sharing of files.

2)To allow fast, efficient searching of the network

To decrease search time compared to current P2P technologies
To allow fuzzy searches that don’t require specific details to be entered e.g. the exact name of the file.
To limit the effect network size has on search time
To keep search time low even when the overlay network is heavily loaded
To keep searches as accurate as possible, so out of date file references aren’t returned in search results i.e. references to files that have already been removed from the network

3)To make the network as scalable as possible in both terms of its performance and in the maximum number of nodes that can be concurrently connected to the network.

4)To make the system as reliable and robust as possible

The network must be able to recover from the loss of supernodes
The network should be able to a reasonable degree withstand targeted attacks by malicious users
Downloads must be able to be resumed without having to re-download all the data already received if transfers are interrupted.
The system should deal with errors in an elegant way.

5)To make the network as independent as possible i.e. avoiding the use as of centralised managed servers, removing any one point of failure making it near impossible to shut the network down.

6)To allow sharing of files over the network

To allow file transfers to be performed as quickly as possible hiding the user from the underlying actions

7)To minimise the time taken to connect to the network

To avoid the use of a centralised server to aid connection to the network.

8)To provide a highly simple, easy to use user interface

To let the user quickly and with ease search for files without having to understand and underlying processes
To provide the user with some form of discriminatory information about specific downloads so they may make an informed decision on which files to download.
To allow the user to modify his/her shares without difficulty
To allow the user to view what he/she is currently sharing
The interface must be responsible at all times even when actions are being carried out behind the scenes such as a search.
The user must be provided with progress details of downloads and be told when a file has finished downloading
The user must be provided with a indicator (i.e. bit rate) of how quickly a file is downloading.

Abandoned Aims

Due to time constraints some aims spoken about in the project proposal had to be dropped, these aims have been listed below; these topics have been touched on in the report but have received no implementation or in-depth analysis.

1)To provide the user with an incentive to share his/her files.

2) To create a P2P network to run over a wireless environment.

3)To allow the protection of files using DRM

To provide all users with the functionality to protect their own files
To allow the safe transfer to already licensed shares
To enforce the controlled distribution and sale of licences i.e. stop users from selling mp3 files but also keep the copy themselves

1.5 Report Overview

Chapter 2 (Background): This section gives an in-depth review of current P2P technologies and provides a foundation on which the report can be easily read.

Chapter 3(Design): This section aims to give the reader an overview of how the system architecture and several key processes are carried out.

Chapter 4(Implementation): This section provides a greater insight in to the workings of the system and shows the system that actually was produced.

Chapter 5(The System in Operation): This short section shows a typical session and the interaction between the system and the user.

Chapter 6(Process Description): This section explains the underlying processes that are carried out during a typical session, this includes both user initiated requests and automatic system initiated processes.

Chapter 7(Evaluation): This section evaluates the effectiveness of the finished system and how well it has fulfilled the aims set out in this chapter.

Chapter 8(Testing): This section lays out the testing schemes used to insure the working of the system and how well the system dealt with the testing.

Chapter 9(Conclusion): This section sums up how well the project aims have been fulfilled and what was done wrong.

2. Background

2.1 Terminology of P2P

Architecture – The topology of the network and how information is routed through the network.

Connecting to the network – The process in which a node becomes part of the network, this is typically carried out by finding an already connected node and then tunnelling information through that node.

Distributed Hash Table (DHT) – A type of P2P network that allocates an amount of hash space to nodes in the network, these nodes then look after all files that reside in that hash space.

File list – A list of files and their location in the network (IP address), extra details can also be kept about the files such as their size and type.

Hash algorithm – A mathematical process performed on a string to create a number that is as unique as possible for the purpose of storing data. If a hash algorithm returns 5 it could be then put the data in an array at position 5, then all that is required to locate the data is to perform the hash algorithm again to find out where it is stored. The simplest hash algorithm is to simply to use the first letter of the string being hashed.

Hash Key – The value returned from a hash algorithm

Hash space – The hash keys that a particular area of storage holds, e.g. using a hashing algorithm that hashes on the first letter in a string, position 0 in an array could look after the hash space ‘A’.

Node – Any computer that is connected to the network

P2P Generations – P2P architectures can be split into three generations, the first is centralised P2P structures such as Napster which uses a central file server to search the network. Second generation networks use decentralised file lists. The third is Distributed Hash Tables.

Servent – A node that is neither a server nor a client, instead it performs both duties, this is a frequent occurrence in P2P networks as every node should be the same.

Supernode – A high performance node that is used in the network to reduce the burden on both the overall network and on individual nodes. This is usually done by caching file lists so a smaller number of nodes need to be searched to find files however they can also be used in different fashions such as caching actual files to reduce download time.

2.2 Analysis of P2P Architectures

2.2.1 Gnutella

The Gnutella protocol is probably the most robust P2P protocol available, it is fully distributed and therefore very resilient to faults. The first process a Gnutella servent must carry out is to connect to GnutellaNet, this is done by connecting to an already connected node, currently the most prevalent method for finding connected nodes is through node caches that maintain databases of existing nodes on the network. Once an already connected node has been found, the new node must then attempt to connect to it, to do this it initiates a TCP connection to the servent then sends the following request string (encoded in ASCII),

GNUTELLA CONNECT/<protocol version string>\n\n

To this, the servent will reply with

GNUTELLA OK\n\n

if it wishes to accept the connection, any other reply will indicate that the connection has been refused.

When a servent wishes to perform a search it will channel the query through the node it is connected to, it should be noted that a servent can have multiple connections to other servents, in such a situation queries will be forwarded onto all the connected servents. If a servent receives a Query that matches content in its shared folder it will then reply with a QueryHit Descriptor, informing the searching node that it has content of interest to it. The Gnutella protocol is summaries in Table 2.1,

Descriptor / Description
Ping / Used to discover other nodes on the network, a ping request should be replied with a Pong
Pong / The response to a Ping; it tells the pinging node that the servent is present and other information such as its IP address and statistics on the data it is sharing
Query / Used to search the network, if a servent receives a Query which matches one or more files that it owns it will reply with a QueryHit
QueryHit / The response to a Query; it will inform the searching servent of information about the host such as its connection speed, it will also provide a list of the files that match the Query.
Push / A mechanism that allows a firewalled servent to contribute file-based data to the network

Table 2.1 Overview of Gnutella Protocol

Every protocol Descriptor from Table 2.1 will be encapsulated in a Descriptor packet, this is shown in Diagram 2.1. There are 5 fields in the Descriptor, the Descriptor ID which is a uniquely identifiable string, the Payload Descriptor which indicates what type of Descriptor, a TTL field which limits the packets ‘reach’ in the network, the Hops field which records the number of nodes the Descriptor’s traversed and a Payload Length field which contains the length of the payload. This header is followed by the payload, which will be one of the Descriptors from Table 2.1.

The TTL field is used to limit the reach that the message will have over the network, every servent that receives a Descriptor packet will decrement the TTL before it is forwarded on, if a servent receives a Descriptor with a TTL of 0 it will not forward it on, therefore deleting the packet from the network.

Descriptor ID / Payload Descriptor / Time to Live (TTL) / Hops / Payload Length
0 15 / 16 / 17 / 18 / 19 22

Diagram 2.1 – Descriptor Header

The TTL is the most important field in the Descriptor, this is because this factor controls how large the network is, in practical terms. The number of nodes in the network possesses no significance to an individual user, this is because a singe node only has a certain degree of ‘reach’ in the network, this ‘reach’ is set by two factors, the TTL and the number of nodes it is connected to. If there was no TTL field, a Descriptor would simply circle the network being forwarded by node after node, as the GnutellaNet tree isn’t a spanning tree these Descriptor’s would get forwarded back to nodes that had already received them and therefore never leave the network, it is therefore impossible to actually search the whole network as, unless the searching node knows the entire topology of the network, it is likely that messages will just start looping through the network uncontrollably if a suitability large TTL field is given to extend reach to the furthest away nodes.

Downloads are performed using HTTP; this is because HTTP is an well established file transfer protocol that can carry out the required purpose perfectly well, it is more suited to the application than a protocol such as FTP as it requires no logging in process to be carried out.