Andreas Weigend (

Social Data Revolution (SDR), INFO 290A-03

UC Berkeley, School of Information, Fall 2014

Class2 –October 7, 2014

Andreas Weigend (
Social Data Revolution, INFO 290A-3 (

UC Berkeley, School of Information, Fall 2014 (

This transcript:

Corresponding audio file:

Corresponding video file:

Containing folder of the whole series:

Andreas:Welcome to the Social Data Revolution, class 2. Today we're looking at the social graph. In the last class we looked at different data sources. Data sources that characterize the individual. Now, we're going to look at data sources that characterize the relationship between individuals.

In network language, we call those edges or arcs, like if Michael and I have a connection, then that is a property of both him and of me. So it could be a confirmed connection where I'm his friend, and he's going to confirm his friend request. Or it could be a one-sided connection, where I'm just following him.

Strength can matter. There could be strong connections, somebody that you message every day, dozens of times; or weak connections, people you just maybe met once and you don't have much in terms of ongoing relationship.

You can explicitly show those connections, or you can infer connections. For instance, if Simon and I are in a picture together and somebody else tags us, we might not be friends, but we can infer from that, in the phygital world, in that world which is physical and digital, that we were at the same party at the same time. Otherwise we wouldn't be in the picture, unless somebody is really good at Photoshop.

There are other graphs, for instance the taste graph, which are the things I'm interested in. But for us here, when we're talking about the social graph, we're talking about connections between people.

There are a number of ways of shedding light onto the social graph. For me, there's one social graph in the world. Different companies like LinkedIn, WeChat, Facebook, they shine different lights onto the social graph. For example, if you're on the social graph in China you do better by looking at the WeChat graph than the Facebook graph, because Facebook is blocked in China.

If you're interested in a professional graph, you're doing better looking at the LinkedIn graph than into the Facebook graph. But my view is there's one graph and there are different ways of looking at that.

Once we have that graph, then we are ready to dive into what we're focusing on in today's class, namely LinkedIn. In the last class, we looked at the different focus of analysis we could have. We could say we can look at an item, we can look at the manufacturer of an item, we can look at the basket -- the bunch of things people put into a basket. We could look at the store, we could look at the cash register. We can look at the truck.

Last week we decided the item we wanted to look at was the individual consumer. Similarly here, we can look at a number of different matrixes we can form here, based on the items we have here. What would be possible item sets for talking about LinkedIn?

It could be the individual and their job. Or it could be the job description. We could look at all the people that have a job call data scientists. Or it could be the country, or a company. There are different levels of analysis, different focus of analysis. We have here a different focus of analysis for LinkedIn.

There are some questions we want to ask and in order to set the stage, I want to remind you, as a quick summary of last class when we talked about collaborative filtering. Remember I had this matrix here, and I said here is (indiscernible) and that cell gets incremented when a person buys both item I and item J in the same session.

Now, I invite you to think about what else it might mean. Here, it could be a person is at company I and then goes to company J. Then the same machinery of projecting out, given that I, what are the paths? Where could they be going? What are the possible J's property normalized? The exact same machinery applies here.That would be how people travel across companies.

Another application is how does information travel across the network. Another question is in Google we have the notion of page rank. Page rank is a property of the page, and it tells us how trustworthy that page is. So in other words, if I have a bunch of pages that contain a certain search term, in which order should we show those pages? Page rank is a property of the page, not of a specific term in the page, but just of the page. That's the only way it's computationally feasible.

Facebook has edge rank. So now my question is does LinkedIn have something like people rank? Is there a way of ranking people, if you're looking for data scientists here, is there a way of ranking the data scientists that seem to have the same qualifications? How would you build that?

Or if you're on the other side, if you're a person looking for a job as a data scientist, is there something like a job rank? Is there a way of ranking potential employers? What are the dimensions of that? How would you compute that?

Whether it's Amazon where you have zillions of products in response to certain search terms, whether it's Google where you have zillions of pages in response to certain search terms, or whether it's LinkedIn, we need to use information in order to actually sort things, in order to rank things, in order to show the most important things on top.

I think we are probably best off, rather than me telling you more of the general things, if we now bring in Simon Zheng. Simon, why don't you come to the front, and we'll give you 15-20 minutes to share with us some of your ideas. And my job in that time is to protect him from your questions. You can jot your questions down on the same whiteboard we used before: Then we'll have an exercise where we'll use the two sides of the room, and we'll tell them about that.

Simon, we met only recently, a few months ago at a workshop, because a former student of mine worked for him. He said, "Andreas, you really should get Simon because that guy is so full of insights." Thank you for coming to class, and thank you for sharing your insights with us here.

Simon:Thank you very much. It's a pleasure being here, being invited by Dr. Weigend for a long time. I'm not a good student. He assigned homework to me as well, and followed up with me a couple of times. I was the last to meet my answers.

My topic today, I was going to talk about some of the use cases derived from LinkedIn social graph data. But I think we only have 20 minutes. I will give you a couple of cases. We'll jump to the top of what do we try to solve.

The first question is how many people use LinkedIn. Way higher than I expected. How many people know LinkedIn is making money? How many people know how LinkedIn makes money? Like a funnel, the Christmas tree. Today I'd like to share how LinkedIn uses the social graph and profile data to derive incremental value and provide more to our LinkedIn members and customers.

A very brief introduction, my name is Simon. I've worked for LinkedIn for five years. I started as a data scientist. A long time ago I was a brain surgeon working in a cancer hospital, twelve years ago. That's why Max mentioned the dream. I said that's exactly what we should do, by using data science and the data product. That was twelve years ago. I was in a hospital working in the brain tumor center.

I had a passion about computer and Internet. At that moment there was no social side, just internet forums. I think data is still my passion. I left the lab and started in this amazing new technology. I've been very lucky in the Valley, and working with a lot of great people. That's briefly about me.

Right now, I lead the LinkedIn data analytics team. I'm a part of the data science group, and we support all the (indiscernible) related areas, including sales and marketing, and engineering operations.

LinkedIn business model briefly, I want to highlight three things. LinkedIn member engagement drives massive amounts of data. From the data we retrieve, innovate, and create new services and solutions for our LinkedIn members and customers. It encourages another round of engagement that grows. Then we have better data, more solutions, more product, and better customer experience. That's what we call a healthy loop.

You can see here, data is one of the most important elements for LinkedIn's business as of today. I will keep these slides about (indiscernible) support and data science.

The reason I want to highlight this, my feeling is in the last five or six years the data volume has been increasing significantly. For example, how many people know what ERP stands for? Great, it's called Enterprise Resource Planning platform. So SAP has a very successful business based on this. Actually, the data is more like transactional information. For example, let's say we go to Amazon and shop online; we have a lot of transactional information. How many items you buys, what's the price, when, who. And traditionally this is unit of measurement in megabytes.

Beyond that we have CRM data. It's called customer-relationship management data. It's more like a marketing related customer profile, location, demographic information. This dataset is gigabytes, as a unit of measurement. Beyond that we have data, like PayPal. I used to work at eBay, seven or eight years ago. I used to support a PayPal account for eBay payment flow. The value of data is significant and large. A bare minimum, we use terabytes as the new unit of measurements.

Stage four, we call social data, which means connections, follows, comments, shares, likes, all of this in social gestures. You would ask me why social data is so big. Actually, people-to-people connections -- if you measure a LinkedIn connection table, maybe we have 200 billion records in that table. However, compared with the impression, clicks, or value of engagement, it's much smaller.

One example of why we say social data is so large, let's assume I don't know any of you. I only know Dr. Weigend. And I know a couple of students here, however, all of us are connected on LinkedIn, Facebook, or somewhere else. So can I use these two students which I know their behavior, their demographic, their transactions to predict the rest of our transactional behavior, engagement behavior, demographic information? I think we can. We can use this only in two gentlemen's engagement behavior, calculating all the rest just because of the social connections. That gives us a big challenge with the volume.

Another question is connections. People A know B, B know C. Let's say there's a theory, after six degree of connections you connect with the whole world. Which means Simon, as of today, I have about 4,000 LinkedIn connections. My third-degree connections reach out to 80 million LinkedIn members. As of today we only have 340 million LinkedIn members. You can see my third degree, I almost cover 25% of the people on LinkedIn.

The volume is significant based on the social gestures, graphic growth. Before we talk about the use case of social data, let me share a bit of the basic, core concept of LinkedIn Analytics today. That's why our team can grow from one person five years ago, myself, into here in less than (indiscernible) people. Hopefully it hasn't grown too fast.

We must set up a very solid foundation from the beginning. The beginning is to understand the business. If we support marketing we should understand marketing. Sales, we should understand sales. We should understand our members, customers, how they use LinkedIn. That's the fundamental beginning for every single data center.

Start from there. Next, we need to fix the tracking. We need to make sure we track the data in the right way. This is more important than just storing random data and starting to analyze it.

Number three, the third step is we need to understand the data deployment in the data system, Teradata, data warehousing, ETL (ph.) flows. Above that, we run analysis, and then we build a report, they call a BI, business intelligence. Then we have deep analytics including machine learning, statistics, and in a strategic level analysis, like McKenzie, consulting type of business analytics. Above that, we (indiscernible) insights, and then we take action by making decisions.

However, the pyramid here is extremely large and slow. I worked for several companies in the past. Most of the data sets, most of the time they spend on the bottom of the pyramid. That's how we started at LinkedIn. 95% of the data center's time is used on the bottom stuff, nothing fancy, no social graph, just a mass of data summarizing some basic information, and report it. This took us 95% of the time.

How did we change this? We decided to use technology to keep collapsing the bottom of the pyramid. Make this like a diamond shape. We keep doing this time by time. May I ask a question? What's the (indiscernible) the triangle versus the diamond? You can say 100 is the triangle or pyramid. How large is the diamond area?

100% right, we collapse this many times. Why do we need to do that, keep collapsing the big pyramid? Because we want to scale. We want to leverage data science and (indiscernible) impact a lot of peoples' lives. For example, the first year when I joined LinkedIn, the first year I worked (indiscernible) Reid Hoffman (indiscernible), all those great people. (Indiscernible) to 500 different type of products, including building models, building reports, answering questions, creating also the first LinkedIn map based on social graph models. 500 of them working day and night.

However, I support 200 people directly at the moment. You can see if you use 500 (indiscernible) 200, we only answered two questions on average per employee per year. That's not data driven. We decided we'd go to this model. We want to encourage every single employee in LinkedIn to use data to help them make better decisions. That's the change, why we transformed the pyramid into a diamond.

We shrink the bottom tier, we move our resources or time spend on the top of the pyramid. The higher we go, the better and large value we provide. The first one I want to share with you -- maybe I need to jump here -- this is very business driven. Think about LinkedIn, it has two most valuable data assets. Number one is members' profile, almost like a resume. Number two is our social graph, social network. How can we use these two data sources to make money? I'll give one example. By the way, LinkedIn makes 80% of their revenue from enterprises and services, like the B2B sales.

Sales people ask a simple question. In LinkedIn database we have five million companies. Which company should I sell this to, how much money (indiscernible) pay LinkedIn every year? LinkedIn is a sales model, annualized subscription. You pay every year to renew. Which company, how much will they buy?

Number two, who will make the decision of buying this service? Number three, how can we reach out? If a lot of people randomly call people, it doesn't make sense. No one would pick up the phone. Number four, which sales person should call this decision maker? Number five, what story should LinkedIn tell this decision maker to buy their service?

Before, to answer these five questions, typically took us two months to three months to answer all these questions. Today, we have transformed this all into a button. We use all the LinkedIn data to find out how many (indiscernible) they have. LinkedIn, their major business is recruiting, talent business. How much attrition do they have? How many recruiters do they have? How many decision makers do they have? All of this -- how many -- it's very basic, based on the LinkedIn members' profile.

We use this to build a model to calculate the dollar value for each of the companies. Then we compute at a member level what's the likelihood this person, with the right title, right network, and right timing to buy a LinkedIn product. What we found, it's not always the VP of HR that will buy a LinkedIn product. We found that actually the recruiting managers, the head of sourcing who has no budgeting power on this, who uses LinkedIn quite a lot and will be very well connected on LinkedIn, they have the highest likelihood of buying. We found this secret. Then we started targeting those users, by providing our sales team a very brief list.

How do we reach out? Our internal sales team analyzes the social network, who has the most influence and connection to this person. I remember checking on the wikipage, Dr. Weigend built online. Dr. (indiscernible) over you guys, so how do you measure influence? That's we're doing on LinkedIn. We calculate the influence of our indirect connections toward this person we try to connect to do business with.

One more point, how about distance between people? Obama is extremely influential. When he says something, would I do it right now? How about my wife calling and saying you need to come home, I have something urgent. I would come home now. The distance between people.