Some Simple Data Mining on Project 2 with SPSS

Some Simple Data Mining on Project 2 with SPSS

Project Idea from MRKT 350 (courtesy Dwayne Ball)

Some simple data mining on project 1 with SPSS

This describes some simple analyses you can do with SPSS to explore the differences between people in your test sample that ordered and people that didn’t.

  1. DOWNLOAD the SPSS data file

There are two data files (Master Customer File1 and 2) in the “Marketing 350” folder of the public “P” drive. Simply double-click on either of them and the data will be opened in SPSS.

  1. Bring up the data window in SPSS

Now that a dataset is on your hard drive, double-click on it in order to open the data window, which will show you the data window. Examine each variable and look at the individual data points in order to understand what each variable is and what each value means.

  1. Look at the frequencies or distributions of various variables.

Click on the Analyze pull-down menu, and go to Descriptive Statistics, then to Descriptives. When the dialogue box comes up, select all the variables into the “variables” list. This will give you some basic statistics – mean, standard deviation, minimum, and maximum. Note that the mean of a binary variable coded as either 0 or 1 is simply the proportion with a response of 1.

When you specify an analysis, the output window will automatically open, showing the results of your analyses. As you do more analyses, the output is added to the bottom of the output window. You can save this output, and even copy tables from an SPSS output window into MS Word.

  1. Explore relationships with the response variable.

Again click on the Analyze pull-down menu, then on Compare Means, and then on Means. Try looking at all the variables (other than CUSTID) averages against the promotional response, WINETAST.

Select all the variables except CUSTID and WINETAST and put them in the “Dependent variables” box. You can do this by clicking on the first variable you want, then holding down the shift key and clicking on the last variable you want, highlighting all of them. Then click the arrow next to the dependent variables box. Put WINETAST in the independent variables box, and click OK.

As with a lot of SPSS procedures, the first table that comes up is a “case processing summary” to check and make sure that there were no problems with the analysis. Just make sure that it appears that all 10,000 cases were processed for each variable chosen, except for the questionnaire items, which will have in the neighborhood of 1,000 cases.

Now, look at the second table, which will be much too long to fit on the page, but you can use the bottom scroll bar to see the rest of it. The first row represents the people who did not respond to the invitation, the second row represents the people who did respond, and the last row is the total. When a variable, like KIDHOME is binary (0,1), the mean in the box for the first row, for example, represents the fraction of people that did not accept the invitation who have a child under 13 at home, and the number in the box below represents the fraction of people who DID accept the invitation who have a child at home. It is the contrast between these two numbers – the mean in the first row and the mean in the second row - that is important in generating ideas about what differentiates those who accept the invitation from those who don’t.

Look for big differences. For example, if the mean household income for people who did not accept the invitation was $80,000, and the mean income for those that did accept the invitation was $120,000, and the standard deviations for those two means were around $15,000, that means that the difference between the two groups is almost 3 standard deviations. That’s a big difference. On the other hand, if the mean difference was only between $85,000 and $80,000, that would only be 1/3 of a standard deviation – not very promising as a variable to separate the accepters from the non-accepters.

After you’ve gone through the whole table – ignoring the questionnaire items, since only 10% of the sample has those variables – you will have a set of hypotheses about what separates the accepters from the non-accepters.

  1. Create the variable that defines the target market (those to be invited), or “The Logic.”:

Use the “Compute” alternative in the Transform pull-down menu. This allows you to create a new variable out of other variables. You want to create a variable called “INVITE” to indicate which people you’d like to inviteto the wine tastings – in other words, the target market. Let’s suppose you’d like to invite people who have spent at least $100 with you in terms of REVLSTYR (revenue last year), but that you’d not like to invite people who are lower income (INCGRP1=1 or INCGRP2=1). Here’s what you would do.

First, click on the Transform pull-down menu, then on compute. In the target variable box, type the name of your new variable, such as “INVITE.” Then, in the “numeric expression box” type (or select from the symbols in the box below) =0. Then, press the OK button. After a few seconds, this will create a new variable called “INVITE,” for which everyone has a value of 0. Check the far right side of the data in “Data View” to make sure you have the new variable and everyone has a zero value for it. Now, you can set about altering that value for selected people.

Again, go to the Transform pull-down menu and select “compute.” You will notice that your previous definition is still there. Wipe out the ‘=0’ in the numeric expression box, and substitute =1. That means, INVITE will =1 for everyone, unless you specify only certain people. To specify only those people who fit your criteria will have a value of 1 forINVITE, hit the “If…” button. At the top of the dialogue box that opens, select “include if case satisfies condition.” Then, type in your logical statement, making sure you get the variable names correct (variable names can be sent over to the box with a mouseclick). In this case, your logical statement in the box would be:

REVLSTYR >100 and INCGRP1=0 and INCGRP2=0

Or, an equivalent way to say it (there are many Boolean expressions that will result in the same target market) would be:

REVLSTYR >100 and (INCGRP3=1 or INCGRP4=1 or INCGRP5=1 or INCGRP6=1)

You can use all sorts of Boolean operators:

and or > < >= <= ( ) ~ (not equal to)

and many more, many of which can be found in the dialogue box that opens to create your “if” statement.

I recommend that you open up a word processor or spreadsheet and copy this logical statement onto a page for later reference. In fact, a good place to keep track of this is on the spreadsheet that estimates total profit for your logic for the entire customer base. When you estimate the profit for this logic, store that in a nearby cell of the spreadsheet so you can keep track of the logical statements you have used and how successful you were.

Now click on “continue,” and when the main dialogue box appears, click on “OK.” You will be asked if you want to change the existing variable – and you do – so click yes. The computer will think for a moment, and then the new values of “INVITE” will appear. Check a few people to make sure that INVITEis being computed correctly. Now, you have a value of INVITE that indicates who you would like to inviteto the wine tasting in the larger database, and you can check your test datasets to see how effective and profitable that promotion is likely to be.

  1. Estimating the profit from the promotion to the entire remaining customer base, given your logic

When you have decided on a possible logic for whom to invite promotions to, you can now crosstabulate the variable indicating who receives the promotion (“INVITE”) with whether or not they responded to it in the test data set (‘WINETAST”). Then, you can estimate how profitable or unprofitable your promotion would be in the real world. This will be done using a spreadsheet discussed in class.

To do this, you must go to the ANALYZE pull-down menu, and select DESCRIPTIVE STATISTICS, CROSSTABS. The dialog box that appears allows you to choose a row variable and a column variable. Select WINETAST as the row variable and INVITE as the column variable, and then click the “cells” button in the upper right of the dialog box. Under “percentages” select “column” and then “Continue.” Then click “OK,” and you will produce a table like this:

Did customer attend (and pay for) wine tasting party as part of program test? * INVITE Crosstabulation
INVITE
0 / 1 / Total
Did customer attend (and pay for) wine tasting party as part of program test? / no / Count / 6454 / 2743 / 9197
% within INVITE / 94.9% / 85.7% / 92.0%
yes / Count / 347 / 456 / 803
% within INVITE / 5.1% / 14.3% / 8.0%
Total / Count / 6801 / 3199 / 10000
% within INVITE / 100.0% / 100.0% / 100.0%

Now, you will note that the number of people you would invite is 3199 in this table, or 31.99% of the population. That percentage goes in the spreadsheet for calculating profit from a promotion.

You will also note that, of the 3199 people you would have invited by this logic, 456 of them, or 14.3%, would have responded by accepting the invitation. This is the response rate. This percentage also goes in the spreadsheet to calculate profit.

When you put those two numbers, 31.99% and 14.3%, into the spreadsheet at the correct points, you will see that the calculated dollar profit is $9,589. Not at all impressive. I’m sure you can do better.

  1. Try different logic until you are satisfied or decide to stop.

Now, suppose you want to change the logic? IMPORTANT POINT! Go back to transform – compute and set the value of INVITE to zero, and click the If button, and set the “include all cases” condition. Click OK, and OK to change existing variable, and then make sure that everyone in the data set now has a value of zero for INVITE. Only then can you return to trying new logic.

Keep trying different logical statements until you are happy with the amount of profit you have made, or at least have used up your patience with this problem and have decided to stop.

  1. Generate a profit calculation spreadsheet for the project 1 report.

At that point, to estimate what you will get in the entire remaining customer database, apply your logic to Master Customer File 2. Just create the INVITE variable in that dataset, according to your logic, run the crosstabulation, and see what the profit is on that data set. It will be somewhat different than what you got in dataset 1, but probably (not always) a better estimate of what will happen if the promotion is “rolled out” to the entire remaining customer base that is selected by the logic.

  1. Send me your logic and target market percentage to check

For a check, send me an e-mail with the logic and the percent of the target market that would be INVITEd by that logic, and I’ll tell you if I get something close to that when I apply the logic to the entire remaining dataset. Of course, I cannot tell you what your response rate or profit will be, just as the keeper of your customer data base in real life could not tell you what the results of the promotion would be until after the results were in. However, he or she could tell you if your logic actually INVITEs the fraction of the database that you think it does. That’s all the check you’d get in real life. If I get just about the same target market size that you do, at least there are no obvious errors in what you’ve done.

  1. Write and turn in project 1

After you hear from me that your target market seems to be the correct size, take a deep breath and write your report, which is essentially a proposal to management to spend some money to run this promotion on a certain target market, and a promise that a certain amount of profit is likely. Turn in your project and keep your fingers crossed!