Transcript of Andreas Weigend

Data Mining and E-Business: The Social Data Revolution

Stanford University, Dept. of Statistics

Andreas Weigend (

Data Mining and Electronic Business: The Social Data Revolution

STATS 252

April 20, 2009

Class 3 - Data: (1 of 2)

This transcript:

Corresponding audio file:

Next Transcript: (Part 2 of 2)

To see the whole series: Containing folder:

Andreas: Welcome to class number three. I want to start by giving you a picture of what’s coming. Reed Hoffman agreed to come in to class number six. As some of you might know, Reed was Co-Founder of PayPal. Prior to that, he was an undergrad student and a grad student, co-term, here at CSLI. He ran the speaker series in 1989-1990. He now runs a company called LinkedIn, which is a professional networking site, basically the world standard. He’s going to come to class with a guy who runs data mining at LinkedIn, for class six, which is May 11th. If any of you are not on LinkedIn, familiarize yourself a little bit with what’s happening. Keep the following question in mind.

We talked about the economics of data and the problem with LinkedIn is those people who are eager to make contact with people, tend to be the ones who don’t have that much to offer, like they want to get a job. On the other hand, those people who have their plate full and have no time, are the ones they want to reach. There is this intrinsic asymmetry that those who have the time may not be the desirable ones, and those who don’t have the time are the ones people want to reach. What do you think could be ways of how this fundamental asymmetry, this imbalance could be addressed by LinkedIn?

Student:I think having the notion of currency – you are interested in people being happy….

Andreas:The problem is that those people who actually have the money are the ones who build shields around themselves and those who don’t have the money don’t have the money to actually get the currency. It has to be currency where we introduce some artificial scarcities, like having one golden bullet message you could send a week. We could have something like people ranking, not in a discriminatory way, but where people could actually show the reputation – some kind of reputation system. The question we would discuss then is what should we surface? For instance, if you are a super reliable guy, you will get hit up by somebody and you always respond within a day. Maybe we should actually show that information about you, so people know he might not have the biggest influence in the universe, but at least he’s going to respond to us.

I’m very happy and grateful to Reed. He’s a super busy guy who also invested in a whole bunch of startups but he’s going to give us three hours of his time. I am very much looking forward to that.

The week afterwards, we’ll have the CEO of Twine, Nova Spivak coming. He confirmed, as well. Until then, we are among ourselves. Maybe we will get [0:02:51.9 Ida Marosen}, from Facebook, to show up for the next class or the class afterwards, to help us a little bit with our metrics.

0:03:03.6 In terms of homework assignments, I will talk the second half of class today about homework. I will get things in perspective here, which is further down. The main worry you had is if you don’t get much traffic to your website for assignment two, part B, don’t worry about that. The point is to show you what is simple and what is difficult to get to in terms of data. Don’t worry if you don’t have more than you and your girlfriend visiting your sight twice a day; the purpose is really to make it clear to you what is easily gotten and what is very hard to get. That was the philosophy behind the second homework set.

It was not meant to be a very hard homework set. The first one was what can you get by simply understanding what Google [0:03:46.4 unclear] provides you with. The second one was how easy it is to set up something and what you can measure, http, refer, and so forth. The third one was how do you use two off-the-shelf things, Yahoo Pipes and Craig’s List to find places you want to live or whatever your heart desires. Don’t worry too much about homework.

Homework 1, we will discuss after break, today. There is a lot of feedback. I love some of the things I saw from you. I’m actually very excited about it. Before I show you that excitement, let’s talk about what we’re doing in the first half of class.

The first half consists of three parts: about five minutes of a warm-up exercise thought experiment that I announced at the end of the last class; a conversation about data mining e-business, what are good data mining problems for e-business? What are their characteristics? In the last part before the break, I’m going to talk about one problem in more detail. That will be one action people and companies often take, namely, figuring out who they should give resources to. The traditional, statistical term for that is “customer lifetime value”. I will construct with you what customer lifetime value is for, and then deconstruct the current term to reconstruct that takes into account the network we have between people, plus the historical component in which we sometimes say we move from transaction economics to relationship economics. That’s the plan for today.

Brian Knutson, a friend of mine who used to be a grad student with me in Psych and is now a professor, he does these wonderful ephemeris experiments, where he finds out that you show somebody a stimulus and within seconds you measure what gets activated, the oxygen content in his brain, and the rest is history; the rest is mechanics. If they like Pepsi over Coke, at that moment, they will always buy Pepsi over Coke.

I want to ask you; if you had devices, like I am, wired up here with all kinds of devices, which recorded everything conceivably possible right now – even if somebody walks by and you decide to turn your head toward that person, or if your concentration slacks off. What I am saying here is if you could measure those things, how would your behavior change, if at all? If we put you with a set of stuff around yourself, it might be heavy but we won’t worry about that. We recorded it and backed it up in some salt mine, how would your behavior change?

0:06:34.9

  • Case one is if only you, with maybe having some signs of lift, like a retina that is still alive or something, some blood pulsing or DNA samples, whatever it might need to show that it’s really you alive.
  • Case two is after your death, maybe other people in your will have been given access to this data.
  • Case three is it would be password protected. You could give Enrique the password and he can check out who makes your head turn.
  • Case four would be that law enforcement agencies and other friends would be able to get into those data.

How would your actions change if everything you did was recorded? What would you do differently? Would it not change? Maybe you’re so used to having everything recorded that it makes no difference.

Student:I would think I would do things less spontaneously.

Andreas:For how long, for a day, two days?

Student:No, it might be … what’s being recorded….

Andreas:Okay, what do other people think? Would there be things you wouldn’t be doing or doing differently? Would you take shorter showers, longer showers?

Student:I would probably be aware of things the first few days, but then I think I would adapt to it, just like having to wear eyeglasses. I would probably once a day or once a week, I would look at parameters that they record… have this device alert me while I’m performing them so I can actually change my behavior.

Andreas:So, self insight would be something of interest to you, knowing that it’s not only seven minutes a day that you spend on Facebook, but actually seven hours a day.

Student:Yeah, I would essentially have this – it’s actually a program I’ve started writing. I would have this thing telling me to stop, or jotting this down. If it were recording my daily actions and it were to actually analyze my daily actions and not just web pages that I view, I would also want it…. For example, let’s say that I don’t want to talk in a loud voice in public places. It would detect that I’m talking very loudly and I’m getting too loud; it would actually give me input and tell me to lower my voice or something like that.

0:09:40.3

Andreas:I heard that in Israel they now have banned phones at funerals. They thought it was not the right thing for people to be talking very loudly there. Is this true?

Student:I’m very sensitive to people talking too loudly, I guess, because I’m from Israel.

Student:I think it would also depend on who can see what action that you take. If you had control over who sees a particular action, for example, you don’t want your boss to know that you did something stupid on Facebook…. If you could control those permissions at a fine grained level, you wouldn’t have to lose that spontaneity.

Andreas:What about if it got indexed? I think one dimension that makes a huge difference is not just whether it’s recorded or not, but whether, with a few keystrokes you can find and align all those moments where you did something. Is that where the main value or danger lies, or where do you think the value of recording versus the value of indexing and making it searchable.

Student:indexing that is available, is that what you mean?

Andreas:If you did video indexing or [0:11:05.6 unclear] indexing, if knowing when that part of your brain is firing. It would be interesting, whenever certain people walked by…

Student:I think if two out a hundred people do it versus ninety out of hundred people do it, they create different results. Two people doing it … increase awareness, attention, intention, so that it becomes the isolated… in that case. Whereas, if many people out of hundreds start doing it, that becomes the norm. There is less attention… habits within that crowd….

Andreas:That’s an interesting one. In Germany, in the mid 1980’s, [0:11:54.0 unclear] era, there was a big movement against passports that could be read by machine. People felt that if we could have our passports or ID cards machine read, then that would give the police too much power compared to if they had to manually write down name after name.

Now, the German passport has my RFID chip in it so if I would just walk in a shopping mall and I don’t have it protected, other people can read out my fingerprint. In the last twenty or thirty years, things have really changed a lot in terms of what people find acceptable. 9/11 has also had its part in that, the percentage of people doing it and creating norms, de facto norms, which people then adhere to.

Student:I think indexing makes it a lot easier to run ad hoc… show me every time such and such happens. If it’s not indexed, assuming you’re still able to… that will require you to say, “What are the really interesting things I am curious about in my life,” and I think that could work to decrease abuse of data mining, where you can’t make… periods…. You could say, “I want to know how I am spending my time,” and it would shift queries towards interesting things better.

0:13:22.3

Andreas:Who of you would be interested – and I have no contact myself, but it’s not hard to get it – the company in San Francisco that I mentioned before called Fitbit. Who would be interested in hitting them up to see if they are willing to give all of us a $99 device that we would be willing to carry around for a couple of weeks to see what sort of data on us we can draw out of it? It’s not a class assignment. I could do it myself, but if one of you were willing to take it… Would you want to do it together? Talk to them, make the contact and I will be happy to show up and go with you if that helps. If it hurts, you can do it yourself.

For them, they have eighty some-odd smart students helping to debug their device before it’s released into the market. It’s a great deal for them. For us, it would be interesting to have a device to collect stuff. It’s not easy to find a lot of information about them on the web, so I think the right way is to try to contact them. If you need help, I’ll help you. If you don’t need help, maybe you could come with fifty to eighty devices in the next class and that would be great.

That was the warm-up exercise. Note that nowhere in this discussion did any business aspect come up. I was curious about that. Nobody said somebody would be able to provide services to me, somebody would be able to sell me stuff, somebody would be able to find me matches, or to suggest people I should be talking to. That was interesting; the e-business aspect, the money opportunities were not primarily on your mind here. It was more the worry about having it indexed and who would see it and what would happen to if after we die.

I now want to spend the next ten minutes or so talking with you about some of the key traditional data mining problems that we have in e-business. One problem that everybody always mentioned was the problem of recommendations. The way I want to phrase these problems is what is scarce, and what’s abundant. Recommendations, recommender systems have the property that your attention is scarce and a company, such as Amazon.com or Netflix, is trying to give you something in return for your attention, so eventually you will buy or find useful.

We will have a class in about four weeks, two weeks after Reed, where I will talk solely about recommender systems. In the first class, I said recommender systems make between 20% and in some cases 50% of the revenues of e-business companies. It is a super important ingredient. I am advising a couple of companies in that space.

0:16:26.1 What other problems can you think of, besides recommender systems in the traditional sense of recommending products to people? What other data mining problems; put Jeff Bezos’ hat on. Jeff Bezos has three hats. One hat is the guy who sells books, i.e. he has a retail store. His second hat is the guy who has a platform that enables others to sell stuff. The third hat is an amazing technologist who just changed the world by providing cloud computing and so forth. Right now, just take the third of Jeff where he is the one who runs an e-business company, Amazon.com, as most of you know it, as retail customers.

When I was Chief Scientist at Amazon.com, what data mining problems do you think I would have been grappling with?

Student:A system that gives actions that are relevant to a user. Instead of recommending something, just having … for instance, buying tickets for this week’s …

Andreas:Cross-selling, basically that given that you always buy certain mp3s, the artist is in your neighborhood and Amazon has to know what your neighborhood, otherwise how would they be sending physical items to you. That artist is going to play so, “How about a 5% on the tickets?” It’s interesting; more money is apparently made with merchandise sold in relation to concerts, more so than on the tickets, I’ve heard.

What other problems? Can you think about if you were to run Amazon, you would get all the data in the universe, every click, every call to the call center?

Student:…

Andreas:In [0:18:32.8 Double E] here… you know very well the distinction between prediction and control. Prediction is a key ingredient. I say this is going to happen but the real money is being made by taking actions in response to the prediction. Steven Boyd has this wonderful example. He teaches [0:18:59.9 unclear] information. He says, “What are the most expensive double-integer… ever computed?” Any idea? The most expensive, like dollars per bit ever computed.

Student:…

Andreas:That’s in the right direction, what other ideas do you have? Think more commercial than NASA. That’s the right direction.

Student:GPS

Andreas:No, it is, according to Steve…

Student:…

Andreas:Nothing compared to his example, which is the coefficients for the airbus controller. There are about a hundred parameters and he makes the argument that virtually hundreds of millions of dollars went into each of these coefficients because if you get those wrong, the cost function is pretty high. Some numbers take a lot of money to be computed. That is only the controller aspects.

The point I want to make is prediction is good. Control is better. If you can predict something is happening and you don’t know what action to take, you are not in as good of shape as you would be if you actually know what action to take, in order to influence what the person is doing.