GDC Class Number: Xxx

GDC Class Number: Xxx

GDC Class ID Number 5501

Half-Life and TeamFortress Networking:

Closing the Loop on Scalable Network Gaming Backend Services

Yahn W. Bernier

Valve

520 Kirkland Way, Suite 200

Kirkland, WA 98033

(425) 889-9642

e-mail:
Overview:

This presentation focuses on some of the various back-end services you might wish to provide to your users as part of your game platform. The purpose of the presentation is to provide a sense for how these services can be designed, how they can be deployed, and hopefully, how you can avoid making incredibly painful mistakes in either their design or deployment.

More and more, having an on-line component to gaming is essential to the success of a title. In addition, as games become more of a consumer entertainment experience, making things easier on gamers becomes essential. The days of needing to do such things as manually type in IP addresses to connect to remote servers are coming rapidly to a close. Therefore, it is important for you to provide a seamless user experience for your gamers as they go on-line. To do this, you will possibly need to provision a variety of backend services. Doing so will dramatically increase the usability of your games, hopefully leading to increased sales and recognition.

This discussion will give you not only a sense of the many different backend services you should consider, but will also provide you with information about how some of these systems can be designed. Although much of the discussion is focused on a few specific kinds of backend services, the examples are relevant for any backend services you might need to deploy.

The main focus for this presentation will be on network messaging that is not directly related to the in-game flow of messages, though some design issues applicable to in-game flows will be covered. At the end, there will be a brief discussion of the PowerPlay industry initiative. PowerPlay is particularly relevant since it addresses Internet infrastructure problems and deployment problems that effect the ability of the Internet to handle both in-game and backend server traffic.

Game Master Server:

For any game having servers spread around the Internet and hosted by end-users, it will be important to have a way for your client software to find these servers. While this example fits the client / server model of first person shooter style games quite well, the general rules are applicable to the back-end services that other kinds of games could use. With this in mind, an interesting back-end service that you will probably want to deploy is a game master server.

A game master server is used to collect a database of active game servers available on the Internet. Your game client (or external “server browser” applications) can then query this collection when users are searching for on-line games to join. The basic networking requirements for a game master server are:

Ability to receive initiation / keepalive and termination of services messages from game servers; and

Ability to receive queries from game clients and return appropriate server identification information to those clients

Although this basic functionality requirements list is brief, there is a further set of additional services you might also choose to provide in deploying your game master server. For instance, you might want to support the capability to restrict the search with a set of search criteria.

Initiation / keepalive messages can be kept fairly small. The game server simply sends a packet to the master server using an agreed upon protocol. For instance, many games have chosen to signal “out of band” traffic by pre-pending a 32 bit integer negative one (0xffffffff) as a header to each query packet. The simplest initiation message in this scenario is then to send a single byte payload along with the packet header to signify that the game server is online. We used this mechanism in Half-Life. Thus, a packet having a five byte payload is about as small as these packets are going to get.

Because UDP packets also have a 28-byte packet header each, the simplest initiation / keepalive message weighs in at about 33 bytes (28 byte header plus 5 byte payload). Unless you are sending additional data only at startup of a server, it may be convenient to collapse keepalive and initiation messages into the same exact message. Also, rather than sending the server's IP address in the packet payload, we can just look at the IP address from which we received the packet to determine that information. This saves a few bytes and makes it marginally tougher to spoof the "from" address (more on that below). For Half-Life, the initiation and keepalive messages were the same. With this base message size in mind, traffic statistics can be estimated as follows (assuming a fairly popular on-line game having up to about 2500 active game servers reporting to the master server at any one time):

2500active game servers

300seconds per keepalive

33bytes per keepalive

= 275 bytes/second load and 8.33 transactions/second load

Using the above data, a game platform supporting this number of servers will create approximately (2500 * 33)/300 bytes/second in traffic to the game master server. This works out to about 275 bytes/second of inbound load on the server in the form of approximately 2500/300 or 8.33 transactions per second. Even a relatively low bandwidth connection at the master server should easily be able to accommodate the traffic coming in from game servers. In addition, termination messages are as simple as the basic keepalive message described so far (moreover, termination messages don't necessarily require sending any additional data even if initiation or keepalive messages do), and their frequency can be assumed to be quite low. Therefore, we'll assume that termination messages probably will not add much to the bandwidth requirements for the game master server.

Purging Servers From the List:

In the real world, game servers are known to crash or become disconnected fairly regularly. Therefore, the game master server must be able to purge non-responsive game servers from its list occasionally. Assuming that game servers send keepalive messages every five minutes, as in the example above, it is probably safe to discard the address of any game server that has not sent a keepalive message to the game master server over the last few multiples of the keepalive interval. For instance, if your keepalive messages come once every five minutes, then you would discard the server from the list if a keepalive message is not received at least once every fifteen minutes. Of course, receiving a termination packet causes the server to be discarded immediately. Because not all servers terminate cleanly, there are often a few non-responsive servers in the list of servers returned to users.

Denial Of Service Attacks on Game Master Server:

Unfortunately, the basic game master server just described (which is directly responsive to small keepalive messages from servers) is open to several straightforward Denial of Service (DoS) attacks. For instance, the hacking community is quite adept at spoofing the “from” address field in IP headers. Thus, the following operations could cause a lot of problems for a game master server:

while (< we think the gaming master is still alive >)

{

< create a random from address >

< send a fake "keepalive" message to master >

}

This kind of attack hurts the game master server by causing it to store a bunch of bogus IP addresses for purported game servers. One worry is that the attack would cause the machine to run out of memory. Assuming that each such record on the master occupies:

6 bytes IP address
4 bytes pointer to next server; and
4 bytes time of last server keepalive (for removing outdated servers)

Fortunately, in this example, at 14 bytes per record, it would be difficult, though not impossible, to force the game master server to die from running out of memory. This is especially true since you will almost certainly want to provision a high-end machine to act as the master server. Also, using this kind of attack, the malicious person might not be able to send enough keepalive messages before the master server starts removing old servers. Then again, the CPU load of iterating through a few hundred thousand servers and comparing timestamps every few seconds could still be a major problem. On the other hand, if your master server stores fairly substantial data about each server, then it is quite possible that the game master server could run out of resources.

The nastier problem with this attack, though, is that it makes the server lists returned to your game clients virtually useless. Not to mention that if the malicious user has added 100,000 or more bogus servers to your master server list that it would mean that every time a user asked for the server list that the master server would try to send them approximately 600Kb (100 K * 6 bytes per server) of data. The master server probably will not be able to serve up that amount of data if there are more than a few users querying it.

Challenge/Response System for Game Master Server:

This is obviously a major problem. One solution to this kind of attack is to implement a challenge/response system that servers must go through in order to be listed on the game master server. To implement a basic challenge/response system, the game master server must store, for each potential server to be listed:

IP address of the server
the time of the request to send a keepalive; and
a random number that must be returned to the server to allow the keepalive message

The master server creates this record (or updates the one for the same from IP address if it already exists) and then sends a packet to the game server containing the random number. The game server then must send a keepalive message to the master and must include the random number in that message. If the challenge request came in from a bogus IP address, then the requester would never receive the random number back from the game master server. In addition, if a keepalive message is received and the random number is wrong or the message is out of date (i.e., the challenge/response record is more than a couple of seconds old), then the keepalive message can be easily ignored.

Of course, even this system still has some vulnerability to a DoS attack where the attacker has sufficient bandwidth to occupy all of the “challenge” slots (depending on data structure used and number of slots allowed to be active and the timeout period on such slots) or simply to overwhelm the master server’s connection itself, but those kinds of attacks are, at least, fairly traceable. For example, id Software, creators of the popular Quake series of games, experienced such an attack in January 2000. The attackers were able to saturate two full DS3 (T3) lines of 45Mbps capacity. I.e., the attackers must have had access to a better than T3 capacity connection.

Additional Data:

The game master server described so far only encompasses the most basic functionality. Additional server specific data can be stored at the master server, especially as a way to streamline the amount of data sent back to users based on queries. This kind of tradeoff of processing and storage for bandwidth can often be a good idea. For instance, if the master server were to store current and maximum players, probably encoded as a byte or short each for most games, then the memory overhead wouldn’t go up very much. Nor would the size of the keepalive packets from game servers containing this data. With this data, queries from users could request that the game master server filter out empty or full servers. In this fashion, the size of the server list returned by the master server is reduced, thereby lowering the server's outgoing bandwidth requirements. In addition, by not always requiring the end-user to talk to each game server to discover the number of players, the load placed on your game servers in responding to information queries could also be reduced.

Finally, the design of the master server should be looked at from a network reliability point of view. For example, is there much consequence to having a keepalive message packet dropped? In the Half-Life case, we decided that there probably wasn't much of a consequence.

Master Server Responses to Clients:

The more important part of the game master server is the client query response portion. Unlike traffic from keepalive messages from servers, the load from querying by users can by quite staggering. The main purpose of the game master server is to send each requester a list of the IP addresses of the active or relevant game servers. The request for the list of servers can be as simple as the keepalive message, except that instead of a keepalive code, the client sends a “list servers” code. Estimating the frequency and number of such list requests to the game master server is a little more difficult.

For instance, if we assume a community having 10,000 users simultaneously on-line at peak times and that each such user requests a new list of all servers from the master server every 15 minutes, we can generate some useful load statistics. Based on these numbers, the game master server must respond to about 11.1 such requests per second. If the requests are about 33 bytes each, then that’s only 367 bytes or so per second of requests coming into the server. Where things become interesting is when we look at how expensive it is to the server to send out the responses.

Handling Multiple Packet UDP Responses

Assuming again that there are 2500 active game servers, sending back the server list to the client will require 2500 * 6 bytes per server, or approximately 15Kb of data per query request. Now, at 11.1 such requests to service per second, the bandwidth required to service this many queries is about 166.5Kb/sec (1.33Mbps). That data rate would nearly saturate a T1 line. Of course, there is a major flaw in this scenario as well, at least if we are using UDP. UDP packets of 15K in size are not supported. In general, anything above 1450 bytes or so is considered bad form for IP based packets such as UDP packets. Routers or other equipment often discard packets larger than about 1450 bytes. This raises the question of how to solve this problem.

There are at least two straightforward solutions to this problem. Both end up sending the server list to the requester in batches. In the first method, the master server stores off the requester’s IP address and over the next few seconds, sends batches of server IP addresses to the client. Assuming these packets will be at most 1450 bytes long, there is enough room for about 240 or so IP addresses per batch. For the example above, this would require about eleven such batches to communicate all 2500 servers to the client. How quickly should these packets be sent back to the client? That depends on two factors, the client’s inbound bandwidth capacity and the master server’s outbound bandwidth capacity. How do we know the bandwidth capacity of the requester? We don’t, but perhaps part of the request is the bandwidth capacity of the requester. Then the spacing of the packets is easy to figure out. For example, if the requester states that it can receive up to 2500 bytes per second on it’s link, then the master would determine the time for sending the next request as follows:

Next Packet Time = Current Time + Size of Current Packet / Remote Receiving Rate

Thus, for 1450 byte packets on a link that can handle 2500 bytes / s of data:

Next Packet Time = Current Time + (1450 bytes + 28 byte UDP header) / 2500

= Current Time + .5912 seconds

The update to this client will take place over several seconds. Of course, if the client’s link speed is stated incorrectly (especially overstated), the game master server could easily flood out the client’s connection. In addition, the requester would also require a mechanism for knowing how long to listen for additional data packets (e.g., embedding “packet # 1 of 6” in the responses, or signaling how many total servers will be forthcoming).

The second alternative is a bit simpler and avoids some of the potential problems noted above because it does not require that the master server remember any state info about the requester and it does not require that the game master server have any knowledge of the requester’s link speed. The tradeoff here is that it can take a bit longer to get the full list from the server (depending on the ping from the requester to the game master server). The second method is more or less a batch request-response method. In this method, the master server stores the servers in sequential order (an implementation detail) with some sort of sequence number associated with each server and the requester simply requests the next batch of servers starting with the server after the last sequence numbered server it received. In other words, the requester code is something like this: