Predicting Customers’ Decision on Purchasing in Web Stores

Abstract

Understanding customers’ purchase decisions in web is very important in the customers’ buying processes in web stores. Predicting the decision of whether customers buy products or not in a web store provides useful information for the service providers. As e-commerce market is getting popular, many studies on the customer’s buying behavior are developed. But the researches on predicting purchase decision in Internet are not many. In this study, we use a clickstream data, which is the activity record of users in Internet. We propose an approach that predicts the purchase decision of customers. Clickstream data is called web log data, and it has important information on on-line transactions. The off-line business is different from the on-line business as it is easy for on-line business to collect information on the customer’s behavior in Internet. This clickstream data is useful to know the user’s behaviors in web sites.

Keywords:

Predicting purchase decision, clickstream data, web log data

Introduction

The handling of goods and services has had the substantial variety since the appearance of the Internet. Many companies, regardless of their size and types, manage on-line channels alone with their traditional outlets. Web stores (also known as on-line virtual stores, on-line shopping malls, etc.) are expected to add a new market distribution, especially for B2C retailers.

Studies on the customer behaviors in Internet are different from those of a traditional retail market. Consumer behaviors in Internet are more dynamic and have more distinct features. The behaviors of Web visitors in Internet sores can be categorized into several types; buying, searching, browsing, learning (Bucklin et al., 2002; Van den Poel and Buckinx, 2005).

Though Internet is a virtual space, there are efforts in capturing unseen behavior Internet users in many ways. Along with them, lots of researches have paid attention to clickstream data. Clickstream data is called web log data, which is the history of web users’ behavior in Internet. Clickstream data provides not only the information with regard to the purchase behavior, but it also gives the information concerning the trajectory at web stores which is difficult to see (Moe and Fader, 2004). Clickstream data can be applied to understanding visitors’ patterns or designing product recommendation systems. It has the possibility in enhancing CRM(Customer Relationship Management).

In this paper, we present a model to predict whether a user is going to buy products or not from a web store he is visiting. In detail, the purpose of this study is to distinguish users’ behavior into two modes (search mode, purchase mode). It studies the possibility of finding the timing when a user make decision on buying a chosen product after searching, collecting, learning the information on this product.

Users’ Behavior in Internet

The three salient characteristics of on-line user behavior are 1) low contact costs, 2) high switching rate, and 3) shopping behaviors of all sorts.

Firstly, customers need transportation costs or arrival time required to reach the store in the traditional market. But Internet users can visit web stores easily without intention to buy products because the time and efforts to enter web stores are less. Therefore, on-line customers often defer their purchase and revisit to buy (Moe and Fader, 2004).

Secondly, the users can move from store to store in Internet. That low searching costs results in frequent comparing prices, qualities and distances among many websites. There are high possibilities for customers to change their regular store or patronages.

Thirdly, on-line consumers show a wide range of shopping behaviors. Traditional shopping behaviors are divided into exploratory searches and goal-directed searches (Janiszewski, 1998). Moe (2003) extends this typology to four shopping behaviors (directed buying, hedonic browsing, search/deliberation, knowledge building).

These spreads of Internet user behaviors can be applied to modeling of products and customers toward web stores.

Clickstream Data

In general, it has been focused on purchasing data set traditionally. But clickstream data holds non-purchase data too such as visit characteristics, types of individual behavior, and demographics et. al. It is difficult to manipulate the datasets because the data is very large and complicated with many components.

There are four methods to collect clickstream data (Shin et al., 2002). Firstly, retailer data is from log files in web servers of on-line vendors. Secondly, shopbot data is user’s trajectory data, intermediating consumer’s movement. Thirdly, experimental data is to purpose to analyze the data through factitious experimental design. Finally, panel data is collected by Internet survey from companies using software embedded in their PC.

Data Adjacency

There are associations and relationships in a decision space, which can be presented in the form of adjacent data. If an item A and an item B are purchased at the same time, then it can be thought to be adjacent between them. This adjacency concepts can be applied to products recommendation or objects visualization in websites (Błażewicz et al., 2005; Condon et al., 2002).

Following graph theory, graph and its matrix represent a relationship structure among objects. Especially it has advantages in combinatorial problems such as handling large datasets or data associations.

A graph consists of two components, vertex and arc. A simple graph is given in Fig. 1. The graph is a set of adjacent data and can be transformed into an adjacency matrix. An adjacency matrix is an n × n square matrix, A = (αij), where αij = 1 if item i is adjacent to item j, and αij = 0 otherwise. Figure 2 presents the adjacency matrix for the graph from Figure 1.

Figure 1 – A graph example

A / B / C / D / E
A / 0 / 1 / 1 / 1 / 1
B / 1 / 0 / 1 / 1 / 0
C / 1 / 1 / 0 / 0 / 1
D / 1 / 1 / 0 / 0 / 1
E / 1 / 0 / 1 / 1 / 0

Figure 2 – An adjacency matrix of Figure 1

Figure 1 and 2 are a sample of a basic graph and a table showing relationship of object in the graph. There is a directed, or non-directed graph according to directness among more developed types of graphs, in that case arcs are replaced with arrows. Besides, frequency is added to data adjacency graphs.

Extended Data Adjacency Matrix(EDAM)

On-line customers are likely to move sequentially from item to item in web stores. Also most web stores arrange categories and products (Figure 3). We assume that these web stores have two categories and each category possesses products respectively.

* Cn = Category n, Pm = Product m

Figure 3 – An assumed web store

Given a session in this website, we examine our model to express detail of clickstream dataset in the session (A session means the entire processes from the initial visit to a site to the exit from the site). Now we assume a user is appeared to activate in this web store as in Figure 4. In Figure 4, the number of parentheses is the duration time(in seconds), which indicates how long the user stayed in the web page. And The matrix in Figure 5 is grounded from graph theory and data adjacency, and we call this Extended Data Adjacency Matrix (EDAM).

* The number of parentheses is the duration time(second)

Figure 4 – A session sample

C1 / P1 / P2 / C2 / P4 / P5 / P6 / Sum
C1 / 0
1 / 1
0 / 0
0 / 0
0 / 0
0 / 0
0 / 0
0 / 1
1
P1 / 0
0 / 0
5 / 1
0 / 0
0 / 0
0 / 0
0 / 0
0 / 1
5
P2 / 0
0 / 0
0 / 0
5 / 1
0 / 0
0 / 0
0 / 0
0 / 1
5
C2 / 0
0 / 0
0 / 0
0 / 0
1 / 1
0 / 0
0 / 0
0 / 1
1
P4 / 0
0 / 0
0 / 0
0 / 0
0 / 0
10 / 1
0 / 0
0 / 1
10
P5 / 0
0 / 0
0 / 0
0 / 0
0 / 0
0 / 0
10 / 1
0 / 1
10
P6 / 0
0 / 0
0 / 0
0 / 0
0 / 0
0 / 0
0 / 0
15 / 0
15
Sum / 0
1 / 1
5 / 1
5 / 1
1 / 1
10 / 1
10 / 1
15 / 6
47

Figure 5 - Extended Data Adjacency Matrix(EDAM) based on Figure 4.

EDAM’s element αij has two parts: the upper area represents frequency of linkage/connection between two items and the lower area means duration time. As is shown in the Figure 4, this is a directed graph, therefore directness should be shown in EDAM. We assume that a row is a predecessor and a column is a successor between those two items I and j. For example, given , the upper value(1) represents linkage/connection frequency between C2 and P4, and C2 is a predecessor and P4, a successor(C2 à P4). Also the duration time of an item is shown up at the diagonal area in EDAM, e.g., , the lower value(10) indicates that this user stayed for ten seconds in the session. There is summation at the last side of rows and columns in EDAM. Every session in clickstream dataset has its peculiar EDAM. In this manner, all of the EDAM are added to be a large whole EDAM of a web store.

We need to know whether the session is connected to buy or not to buy. For this purpose, we need two EDAMs, and one is EDAM presenting “to buy”, the other EDAM representing “not to buy”. Then given a session, we estimate which EDAM of the two represent more for this session. And the degree of linkage on the graph of a session is measured by the density. On the case of Fig. 4, it is given a set of two adjacent data (Fig. 6).

1st / 2nd / 3rd / 4th / 5th / 6th / Sharing
Time
C1àP1 / 6
P1àP2 / 10
P2àC2 / 6
C2àP4 / 11
P4àP5 / 20
P5àP6 / 25

Figure 6 – two adjacent data of the session in detail

In a undirected graph, given some vertices({v1, v2, … vn}), the largest amount of arcs is n(n-1)/2, but n(n-1) on a directed graph case. So, adjacency points of Figure 4 is given in equation (1).

(1)

* n is the number of all vertices, and is the summation of sharing time between two adjacent vertices.

For example, adjacency points of Figure 4 is {(1+5)+(5+5)+(5+1)+(1+10)+(10+10)+(10+15)}/7*(7-1) ≒ 1.86

Results and Discussion

We acquired a clickstream dataset and divided it into two, one is a dataset to buy, the other not to buy. Those two datasets have the same amount of sessions. And we separated the dataset into the training data (80%) and the testing data (20%). To validate the suggested model, we tested these datasets with three other methods.

① AvgVisitTime: Total time of a session/No. of items visited

② SwitchCat: Total number of switching categories/No. of items visited

③ Re-visit: Total number of specific pages which are revisited

④ AP: Adjacency Points(in our model)

To investigate the effectiveness of these methods, we conducted experiments with these four methods respectively using test data. Table 1 shows the results.

Table 1 – Comparison of Prediction Accuracy(%) of four methods

Methods / 1st
trial / 2nd
trial / 3rd
trial / 4th
trial / 5th
trial / Average of
Prediction (%)
AvgVisitTime / (Train: 69.1)
Test: 71 / (Train: 69.1)
Test: 75 / Train: 70.3
Test: 70 / Train: 70.5
Test: 67 / Train: 70.7
Test: 68 / Train: 69.9
Test: 70.2
SwitchCat / Train: 59
Test: 58 / Train: 58.4
Test: 58 / Train: 56.8
Test: 64 / Train: 58.6
Test: 59 / Train: 56.6
Test: 65 / Train: 57.9
Test: 60.8
Re-visit / Train: 59.6
Test: 62 / Train: 60.8
Test: 56 / Train: 61
Test: 55 / Train: 59.2
Test: 64 / Train: 59.4
Test: 63 / Train: 60.0
Test: 60.0
AP / Train: 69.9
Test: 74 / Train: 70.5
Test: 74 / Train: 70.1
Test: 75 / Train: 70.5
Test: 71 / Train: 70.7
Test: 67 / Train: 70.3
Test: 72.2

From the results of experiments, we can conclude that data adjacency based on graph theory predicts better than other methods whether to buy or not to buy a product in Internet shops with highest accuracy of 72.2%.

Conclusions

As B2C market increases, web stores are focused more on their relationship with customers. Traditionally, customer’s intention to purchase is one of the very important research topics, thus it is a major issue to distinguish between customers who are buying and not buying. In this article, we confirm that data adjacency based on graph theory is useful approach in Internet application as well.

Our model with EDAM(extended data adjacency matrix) results in AP(adjacency points) of a session. The results confirm that the performance of our model is better than other methods. Therefore, the prediction to buy or not to buy with EDAM and AP helps practitioners manage their jobs, and it also helps researchers develop their studies.

There are some limitations in our study. First, our model is better than others to predict totally, but partially lower in a specific prediction. Second, for generalization of our model, it is recommended to examine the level of the proper data size, the types of business.