ITC308 Group 6 /
Project Demonstration and Documentation
High-performance web crawler /
Alexander
10/4/2015

Table of Contents

Demonstration of required functionality

Evaluating accomplishment of business goal №1: Design reliable database structure.

Evaluating accomplishment of business goal №2: Design software which automates gathering of price information

Evaluating accomplishment of business goal №3: Deliver a fast and reliable way of accessing the database

Evaluating accomplishment of business goal №4: Prevent unauthorized access/modification of data stored in the database

Evaluating achievement of goal №5: Secure the database from accidental crashes, provide a way to quickly recover the database

Test documentation

Crawler reliability test

Database reliability test

Database performance test

Crawler performance test

User training manual

Crawler software

General information

Crawler configuration

Crawler launch parameters

Crawler output

MySQL Database

Connection to Amazon AWS instance

Maintenance manual

References

Demonstration of required functionality

In order to demonstrate that the current project has achieved all the business requirements, we have to evaluate requirements established during ITC306 subject:

BUSINESS GOAL
/
EVIDENCE OF GOAL ACHIEVED
Design reliable database structure / Database has correct structure which keeps complete information about products; at the same time free of redundant information which can lead to data anomalies
Design software which automates gathering of price information / Reliable and fast piece of software which automatically collects price information and stores it in the database, which eliminates the need to update this information manually
Deliver a fast and reliable way of accessing the database / Database is stored on high-performance reliable server with constant internet access and maximum uptime in order to minimize the risk of the database becoming unavailable
Prevent unauthorized access/modification of data stored in the database / Correct database user access rights are set up in order to prevent unauthorized access or accidental modification of information by a client
Secure the database from accidental crashes, provide a way to quickly recover the database / Database is backed up at constant time intervals; automatic recovery in event of a disaster

Evaluating accomplishment of business goal №1: Design reliable database structure.

In order to create reliable database structure which is free of redundant information, we had to use database normalization techniques. By analysing what kind of information needs to be stored in the database, we came up with the most optimal database structure (Fig. 1)

Figure 1

The product information is stored in two tables: products and prices. Those two tables are related to each other via one-to-many relationship, this design choice is intentional: because price changes over time, the database can keep price change history by creating multiple entries in prices table associated with the same product. This also prevents entry of duplicate information about products or price data, which results in smaller database size and better performance. Those factors combined result in a reliable database structure, thus meeting the established business goal.

Evaluating accomplishment of business goal №2: Design software which automates gathering of price information

During project programming stage, we have created high-performance automated web crawler. The crawler accomplishes several goals: it creates the list of all the products present on the supermarket’s website; gathers, analyses and updates product and price information by comparing data received from web server with the data stored in the database. A lot of effort was put into maximizing the crawling process speed; high performance was achieved by utilizing multiple threads which access the website simultaneously and pass the information to another set of threads which work with the actual database (Fig. 2).

Figure 2

The crawling process is fully automated and requires no human intervention (other than setting up the configuration file, which is covered in User Manual partof this document). Therefore, we can conclude that business goal №2 has been achieved.

Evaluating accomplishment of business goal №3: Deliver a fast and reliable way of accessing the database

At the current project stage, all product information is hosted on MySQL Community Server 5.7.8 running on Amazon AWS. The MySQL Server is paired up with a slave MySQL Server used for data replication (Fig. 3).

Figure 3

Slave MySQL Server is important for a few reasons: it replicates the data, thereby creating live backup on MySQL Master; it distributes the load on sever, thereby improving performance; it provides flexibility (client can use several servers to pull data from). Therefore, by using MySQL (which offers great performance and reliability) paired up with Master-Slave system running on Amazon AWS, we have created fast and reliable method of accessing the database.

Evaluating accomplishment of business goal №4: Prevent unauthorized access/modification of data stored in the database

In order to achieve this business goal, we have used difficult passwords (20 alphanumeric + special characters, randomly generated) for both administrative and client accounts. On top of that, we have disabled any kind of unencrypted connection to MySQL server, which prevents man-in-the-middle attack. Lastly, we have set up correct access rights for client’s MySQL account, which allows them to read the whole database, at the same time denies any kind of database data modification, which prevents accidental corruption of the database by the client’s software. Thus, the current goal has been achieved.

Evaluating achievement of goal №5: Secure the database from accidental crashes, provide a way to quickly recover the database

This goal is related to previous business goals: as we have established, the database is protected from accidental crashes and can be recovered quickly by using MySQL replication feature; Amazon AWS Backup system is used to provide an additional level of security.

As we have established just now, the project meets all the client’s business goals.

Test documentation

In order to evaluate the performance of the system, we have used several reliability and performance tests.

Crawler reliability test

In order to test reliability of the core of this project – the crawler, the process of discovering product ids and retrieving product information was repeated several times. Refer to the table 1 below for the results:

Date of the test / Number of discovered products / Number of products saved to the database / Number of crawling errors
25.09.2015 / 65197 / 65197 / 0
27.09.2015 / 65197 / 65197 / 0
28.09.2015 / 65197 / 65197 / 0
29.09.2015 / 65197 / 65197 / 0
30.09.2015 / 65197 / 65197 / 0

Table 1

As one can see, the current crawler implementation is completely reliable. The crawler design allows for the crawling errors to happen (for example, if the website goes down for a short period of time), the crawler recognizes errors and tries multiple times until the product information is retrieved.

Database reliability test

As stated in this report previously, what makes a database reliable is proper design. The current database structure prohibits duplicate entries, which ensures database consistency. During the development of crawler, the database reliability was tested multiple times using ‘black box’ method. According to Mariani, Pezze, Riganelli, & Santoro (2012), this method consists of verifying data which is being added to the database (INSERT operation), as well as verifying the result of SQL SELECT queries. Because crawler software compares price data fetched from the website with the data stored in the database by constantly executing SELECT query, and writes updated information by executing INSERT operation, the crawler in this case acts as a black box testing device.

Date of the test / Number of INSERT operations / Number of SELECT operations / Number of database errors
25.09.2015 / 130394 / 65197 / 0
27.09.2015 / 3 / 65197 / 0
28.09.2015 / 0 / 65197 / 0
29.09.2015 / 0 / 65197 / 0
30.09.2015 / 2 / 65197 / 0

Table 2

As one can see from the provided table, each crawling session results in a certain number of INSERT and SELECT queries. INSERT queries occur when there is new information being added to the database, SELECT query is used to find and verify information stored in the database. No errors were reported by MySQL engine, which establishes that the database reliability test was passed successfully.

Database performance test

In order to test database performance, we have used MySQL Workbench software to evaluate the server load in real-time. Refer to Fig. 4 for details:

Figure 4

Using 30 threads, the database executes about 25 queries in a second. Because MySQL caches tables which are used frequently by putting it into RAM, the database shows 95% cache read efficiency rate, which means only 5% of SQL queries are actually read from the hard drive. Currently, the database system shows no sign of slow or sluggish performance, which means that it can be scaled easily to accommodate for more clients.

Crawler performance test

To evaluate crawling performance, a performance indicator was built in, called “products per second crawled”. Because crawler functionality allows executing several threads at the same time, we decided to find out the optimal number of threads (Table 3) (test was run during evening hours):

Number of threads / Products processed per second (avg)
1 / 4
2 / 9
3 / 11
4 / 12
5 / 14
10 / 14
15 / 16
20 / 18
25 / 19
30 / 20
50 / 20

Table 3

As one can see from the table, the crawling speed increases as the number of threads rises, peaking when the number of threads is around 30.At this point we had a suspicion that the performance of crawler is largely impacted by how busy the website is during different times of day. To test this assumption, we ran the crawling process at different times of a day using the same number of threads (30) (Table 4):

Date and time of the test / Crawling speed
25.09.2015, 16:55 / 18 products/second
27.09.2015, 22:18 / 17 products/second
28.09.2015, 13:45 / 18 products/second
29.09.2015, 1:34 / 21 products/second
30.09.2015, 6:32 / 23 products/second

Table 4

As one can see from the table, the crawling speed never drops below 12 products/second, reaching its maximum performance during late night/early morning hours. Therefore, it is recommended for the client to run crawling process during night in order to maximize performance.

User training manual

Crawler software

General information

Crawler is the central piece of the project, it gathers information from web server and saves it to the database. Crawler module is a Windows console executable file, compiled using Microsoft Visual Studio 2013, utilizing Microsoft .NET v.4.5 framework. The .NET framework is required to run this software. All other necessary libraries are supplied with the crawler.

Crawler configuration

The crawler uses config.ini file stored in the same directory to set up its configuration options. The sample config.ini file is provided below:

[Crawler]

mysqlhost=185.24.63.85

mysqlport=50000

mysqldatabase=woolworths

mysqluser=username

mysqlpassword=password

mysqlmode=secure

The config.ini file is used to set up database connection options. At the start-up, the crawler reads the configuration file and uses this information to connect to the database. The description of configuration options is provided below:

mysqlhost – specifies the address (IP or domain name) of the MySQL Server

mysqlport – specifies the port number the MySQL Server is listening on

mysqldatabase – specifies the database schema used to access the database

mysqluser – specifies the username to be used while working with MySQL

mysqlpassword – specifies password for that specific user in order to log in to the account

mysqlmode – specifies if the crawler should use encrypted connection to the database, if this field contains ‘secure’, encrypted connection will be used, if this field is set to ‘insecure’, then regular non-encrypted connection will be used

Crawler launch parameters

Crawler can be launched with the following parameters by specifying them at the command prompt after crawler.exe, separating parameters with space, for example crawler.exe /fetch /time=3:00 /threads=15. The following list shows all available launch parameters and explains how to use them:

Discovery – starts crawler in product ID discovery mode. Crawler generates a list of product IDs present on the website and saves it to a file called productid.txt in the crawler directory.

Fetch – starts crawler in product information fetch mode. This mode uses productid.txt file generated in discovery mode in order to retrieve/update product/price information by submitting it to the MySQL database.

Threads – followed by a number, denotes how many threads to use for discovery/fetch process. Refer to Table 3 to find the optimal amount of threads.

Time – delays crawling process until a specified point in time. Use the following notation: hh:mm, where h stands for hours, m stands for minutes using 24-hour format. For example, /time=3:00 will start the crawling process at 3am.

Schedule – sets up a scheduler to run crawling process repeatedly, without the need to re-launch crawler every time. Use the following notation: nnp, where n is a number and p is either one of: h (hour), d (days). For example, /schedule=3d will set up scheduler to run crawling process every 3 days.

Crawler output

During discovery/fetch process, the crawler displays information about ongoing operation according to Fig. 2. The following information is displayed: number of products processed so far; number of products left to process; current speed of crawling process expressed in processed products per minute; approximate time remaining to complete this process expressed in minutes and the MySQL query which is being executed at the present moment.

During discovery process, the discovered product IDs are saved into ‘log.txt’ file, located in the same directory as crawler executable. During fetch process, crawler uses product IDs stored in ‘log.txt’ file to retrieve product information and store it in MySQL database.

Any errors encountered during discovery/fetch process are logged to ‘error.txt’ file located in the same directory as crawler executable; this file can be used to troubleshoot problems by sending the file to the developer team.

MySQL Database

Product information is stored in MySQLCommunity Edition version 5.7.8 running on Amazon AWS instance. In order to use the database for applications, one should examine the database structure depicted in Fig. 5:

Figure 5

Please note the data types used for storing product information in order for your application to use it correctly (e.g. all prices are stored as decimals with 2 digits after decimal point). In order to connect to the database, use server address, username and password supplied to you. In order to retrieve information from the database, a valid MySQL query should be created. If you are having problems creating a particular query, contact our team for support. Some of the possible SQL queries are provided below:

SELECT ProductPrice, ProductSpecialPrice, ProductMultiPrice FROM prices WHERE ProductId = 705717;retrieve all prices for a specific product ID

SELECT * FROM products ORDER BY ProductId; retrieve all products from the database (without prices)

SELECT ProductPrice, ProductSpecialPrice, ProductMultiPrice FROM prices WHERE ProductId = 705717 ORDER BY PriceTimestamp DESC LIMIT 1; retrieve the latest price information for a specific product ID

Connection to Amazon AWS instance

In order to connect to Windows instance, use the provided .RDP file with connection properties to set up connection with Remote Desktop app (Fig. 6). After the connection is established, you can edit crawler options and/or launch crawler instance.

Figure 6

Please note that we don’t provide access details to Linux AWS instances with the MySQL Server; those instances will be managed by our team.

Maintenance manual

Because the entire crawler infrastructure (Windows/Linux Amazon AWS instances) will be managed by our team, user maintenance is not required. Any problems encountered while working with crawler or database should be reported to our team. Depending on the gravity of issue, our team may take up to 24 hours in order to fix the issue.

References

Mariani, L., Pezze, M., Riganelli, O., & Santoro, M. (2012, April).Autoblacktest: Automatic black-box testing of interactive applications. In Software Testing, Verification and Validation (ICST), 2012 IEEE Fifth International Conference on (pp. 81-90). IEEE.