Project Proposal: Dynamic Web Clustering with Apache 2

Project Proposal: Dynamic Web Clustering with Apache 2

Advanced Distributed Systems

February 2007

Brian Duddie

Overview

Problem: Web sites can experience rapid fluctuations in traffic, which can spell disaster in a shared hosting environment, and often times getting dedicated resources capable of handling these traffic levels is prohibitively expensive. There is no good solution today for sharing the resources of many servers to many web sites efficiently and cost effectively.

Goals: The primary goal of this project is to develop a smart, efficient load balancing solution for Apache 2.x with support for automatic instantiation and migration of web sites across the nodes, effectively scaling the serving capacity for a web site on the fly. A secondary goal is to learn about the techniques used today to create high performance, high availability server clusters to handle internet applications.

Background of the Problem

Most often, web sites are hosted in a shared environment where many web sites are on the same machine. For the sake of cost reduction, no replication or load balancing is used. Each server is completely independent of one another, and a single point of failure for the services of all users hosted on the server. Resources are oversold, and as such a web site using more resources than the average, even if it is within the plan limits, will be removed.

One attempt at solving this problem is clustering through server mirrors and hardware load balancers. The main problem with this approach is cost. To create reliability, at least 2 servers must be mirrors of each other, reducing overall capacity. In addition, hardware load balancers are as expensive as a high-end server, and must be duplicated to remove them as a single point of failure. To make up for the high cost of such a setup, resources are even more oversold. Web sites still only have access to the resources of their mirror group, even though the host may operate many more servers. This is not the best solution, and does not sufficiently address issues of scalability.

Today, clustering techniques do not take into account the dynamic nature of traffic to web sites. Resource demand can explode in a moment, yet the allocation of resources must be done manually, statically, and well in advance of any demand. This approach assumes that web site administrators know in advance the amount of traffic their website will generate, and that they have the money, time, and technical knowledge required to setup a system capable of handling that traffic. The reality is that many times, web sites are not ready for a traffic spike, and this spike can bring down the entire server and all services running on it. This means that the web site will have lost the potential benefits that an increase in traffic brings, such as an increase in repeat visitors, customers, ad revenue, etc.

Application of a Solution (Why it is important)

One notable example of traffic fluctuation is the result of a link to a small site making it to the front page of a large traffic news site, such as Slashdot or digg. Under this situation, it is common for the number of unique visits per hour to increase more than 100 times within minutes, and hold a sustained traffic level of at least 10 times greater than the norm for a week or more. This sudden influx of traffic often times cripples the server, rendering it unresponsive for several hours. This phenomenon happens so often it is known as the “Slashdot effect” or “digg effect”, respectively. With on-demand scaling, a web site experiencing this phenomenon will automatically expand capacity to service the need, and no service interruption would occur.

Approach

To develop a solution, I will be pursuing a software-only approach to load balancing, replication, and dynamic scaling. A successful solution will allow any web site access to the resources of the entire cluster without the requirement that it exist locally on all servers at all times. There are many issues that need to be addressed in such a configuration, like issues with database servers. These are outside of the scope of my project; my focus is limited to implementing software load balancing with the Apache 2 HTTP server, and issues associated with dynamic scaling of the services provided by Apache.

Solution Characteristics

Some desired characteristics of a practical solution other than the core requirements are outlined below. In addition to being the characteristics of a good solution, these are potential barriers to the feasibility of a solution, and must be overcome.

Efficient Algorithms: Efficient algorithms must be found or developed which perform decision making for load balancing, and scaling. These algorithms must cause minimal overhead as to not impact the latency of serving web requests. In addition, these algorithms must operate off of information that can be gathered without impacting performance.
Responsiveness: A solution must be fast enough to respond to load quickly and effectively, so that little or no slowdown is experienced by the user. This must be true of both the load balancing aspect and the scaling aspect of the project.
Security: Web servers are under frequent attack. Common exploits involve taking advantage of directories with lax permissions, allowing a malicious user to place a script of their choosing in a user’s web directory. Since instantiation requires write access to web directories, care must be taken to ensure that this does not become an avenue for compromising a server or entire cluster.
Session Handling: Many web applications make use of server-side sessions. This issue must be addressed somehow, either by ensuring that once a user has started a session on one server, that its requests are always sent to that server, or by making sessions accessible to multiple servers.