Lecture 3 - CSCI494 - Anatomy of Search Engine (Coding a Basic Crawler)

January 26th, 2012 admin No comments

CSCI494 Lect. 3 Jan. 25 2012 (Slides)

Last lecture (1/25/12) we discussed the elements of a search engine including crawlers, spiders, indexers, repositories, lexicons, ranking modules, and query processors. We also talked about fetching URLs and reviewed initial code for a crawler for assignment 2.

Here are the slides:
Anatomy of A Search Engine. Assignment 2 Building a Basic Crawler.

Next week we will derive “The Google Matrix” formula.

Share and Enjoy:
  • Print this article!
  • E-mail this story to a friend!
  • Facebook
  • TwitThis
  • del.icio.us
  • Digg
  • LinkedIn
  • Ping.fm
  • Sphinn
  • Yahoo! Buzz
  • Technorati

It’s Nonsense to Claim that Google + is the Fastest Growing Social Network in History.

January 19th, 2012 admin No comments

Summary
Google + announced they hit 90 million users. By setting reasonable “initial conditions” we can show that there is not sufficient evidence to conclude Google + is “the fastest growing social network in history”. If we use Facebook’s published historical growth rate as a simple predictive model for true “viral adoption rates” we would expect Google + to announce that their user base is at approximately 165 million users by the end of March 2012. Data for Google + adoption is skewed because of a forced usage approach, free advertising, and an initial seeding by a large data base not originally available to other established social networks. Until we have more official data points there is no evidence that a viral (exponential) adoption rate, comparable to Facebook, is really taking place. Data from Google insights for Facebook and Google show large discrepancies.

Google’s Announcement
Last week Google announced their fourth quarter earnings. Included in their quarterly report was a statement about Google + adoption rates. Larry Page, CEO of Google said: “I am super excited about the growth of Android, Gmail, and Google+, which now has 90 million users globally – well over double what I announced just three months ago.

So, in other words, it doubled over the time period since they last reported a number. Google + was announced ~ 8 months ago. Since then, there has been a lot of discussion about how many people have signed up to use this new service. We are seeing a lot of public articles using headlines with phrases like “explosive growth” and “the fastest growing social network in history”. While the numbers are large the comments are related to absolute numbers with no reference to any expected growth rate. Absolute numbers by themselves are not very interesting. We need to look at numbers relative to something else to get meaning. Google has now published two official data points. To validate the claim (hype) that Google plus is the fastest growing social network we can compare data points where the “critical masses” of the user base are the same. We now have a few valid historical models to compare the viral growth rates of social networks against. In addition, the initial hype surrounding the announcement of Google + has worn off.

The initial claims that Google + was “the fastest growing network” are nonsense for a couple reasons. The first reason is that Google + started with a “seed set” of subscribers who are willing to signup for any new Google product. The conversion rate from users of Google products to any new product is high and can be approximated from past product launch adoption rates. Facebook and Twitter did not have this advantage when they launched. The second reason is that, until now, we have only seen a doubling in the number of users in the last six months. If true viral adoption of Google + is occurring then we would expect to see an exponential form similar to 2^t growth. The difficult question is “When can we reasonably expect exponential growth to form?” Google has other factors, like forced adoption, that are skewing the data.

Exponential Curves and Social Networks
The beautiful thing about social networks is their ability to create exponential curves (similar to 2^t). It is difficult to produce exponential curves in other types of marketing. The effect is really caused by one simple mechanism. The connection that your connections have and the fact that those connections can observe the interactions. Otherwise things, in general, look more linear. Marketing in other media usually appears more linear in nature. Of course the type of curves we can generate depend on many variables. Including how long a brand or product has been in existence. For example, you will not typically see exponential curves in a market where the product is a commodity. In the case of Google + we are talking about the viral adoption of “the next greatest social network”. Yes, we have been told that it is now “not a social network” and that it is “so much more”. This may be the long-term intent, however, in order for it to be what “it is now not”, and provide the kind of personalized search Google wants to provide, then people have to provide social data on a consistent basis to this network . So we need to call it what it is. Google + is a social network with some neat features.

Thus far we have heard a lot of big numbers but little discussion about the shape of the graphs or numbers relative to something else. Shown below is a graph of the estimated growth rate of Google + over the last six months. Paul Allen lists his estimated statistics but there is no analysis of what this data means. There are really only two confirmed data points. The red line, below, represents linear growth. From this graph we can’t make a claim that Google + is going to grow faster than Facebook. We also can’t claim that it will grow at the exponential rate we would expect for true viral adoption. A forced adoption rate would start to appear more linear and begin to slow down.
lineargrowth_of_google_plus

The only exponential curve I can find is the relative interest in Google (below) + and it has moved in the wrong direction. And it is not what we see for Facebook.
googleplustrends1

Notice that search volume for Facebook grew in proportion to the adoption rate and only recently leveled off (see below). The search volume has also never fallen exponentially down. Twitter also shows an interest graph similar to Facebook.
facebooktrends

If we take a look at adoption rates on Facebook and Twitter we see exponential curves. This means that the growth rate is dependent upon current value of the function or the “system”. The rate of growth is not a constant. For a social network this means that as the number of user increases we expect the adoption rate to grow because more people are using it. This will happen if the viral mechanisms are working correctly. This new announcement from Google + is only our second data point.

Shown below is a plot I created of Facebook’s and Google + growth since their launch. The blue graph shows Google + growth imposed over the graph when Facebook hit a user base of 90 million. The red graph is Facebook’s growth since launch. The time axis is in months. The growth for Google plus was shifted to the right in order to line up with Facebook’s growth after they hit 90 million users. This is to ensure the curves have similar “initial conditions”.

facebookgrowth

At this point we need to see the next announcement from Google at the end of Q1 2012 (March). If we are going to claim that Google + is growing at a faster rate than Facebook we would expect Google + to hit ~165 million users by the end of March 2012. Until we have more official data points there is no evidence that a viral adoption, comparable to Facebook, is really taking place.

Share and Enjoy:
  • Print this article!
  • E-mail this story to a friend!
  • Facebook
  • TwitThis
  • del.icio.us
  • Digg
  • LinkedIn
  • Ping.fm
  • Sphinn
  • Yahoo! Buzz
  • Technorati
Categories: Social Media, Uncategorized Tags:

Computer Science 494 Search Engines & Social Networks MSU- Spring 2012

January 12th, 2012 admin No comments

I was invited by MSU to teach CSCI 494-01 as an adjunct professor. The course is a senior-level course on search engines & social networks for computer science majors. Here is a description of the course and the topics to be covered:

Syllabus- CSCI 494 01 Spring 2012 - Search Engines and Social Networks

Description:
This course will cover important topics related to search engines, social networks, information retrieval, and data science. Students will study papers, patents, and algorithms written by search engineers and computer scientists. At the end of the course students will understand the algorithms and technology behind modern search engines and social networks.

Prerequisites
Calculus, completion of at least one course in a programming language (Java/C++), and HTML/CSS.

Date Lecture Topic
1.11 Introduction & Overview
1.18 History of Search
1.25 Overview of Search Marketing
2.01 Organic Search: Search Algorithms
2.08 Information Retrieval
2.15 Patents/Papers: Organic Search
2.22 Papers: Organic Search
2.29 Paid Search Algorithms
3.07 Social Networks & Algorithms
3.14 Spring Break – No Class
3.21 Social Networks & Algorithms
Paper titles for presentation due (Groups of three)
3.28 Analytics, Metrics, and Data Science
4.04 Semantic Web Technology (TBD)
4.11 Presentations 20 min TBD
4.18 Presentations 20 min TBD
4.25 Presentations 20 min TBD

Textbook
• Required papers and patents to be handed out during lecture.

Course Outcomes
• Understand algorithms behind modern search engines.
• Understand paid search marketing algorithms.
• Understand the history of search engines.
• Understand important patents related to search engines.
• Be prepared for basic interview questions from companies that develop search engines and/or social networks.
• Discover potential Thesis Topics for Graduate School

Grading
25% of a student’s grade will come from attendance of regular lectures and attendance of the final presentations and 75% will come from the student’s final presentations and papers.
Final presentations will be graded as follows:
50% on the groups 20 minute final presentation
50% on the individual’s 1-2 page paper

At the end of the semester, grades will be determined based on your class average as follows:
• 93+: A
• 90+: A-
• 87+: B+
• 83+: B
• 80+: B-
• 77+: C+
• 73+: C
• 70+: C-
• 67+: D+
• 63: D
• 60: D-

Share and Enjoy:
  • Print this article!
  • E-mail this story to a friend!
  • Facebook
  • TwitThis
  • del.icio.us
  • Digg
  • LinkedIn
  • Ping.fm
  • Sphinn
  • Yahoo! Buzz
  • Technorati
Categories: Uncategorized Tags:

Steps to Google +’s failure & Why do a limited release on Google +?

July 19th, 2011 admin No comments

My shortest post ever. It’s also to document (for fun) my prediction, from day one, that Google + will fail. An argument I am having with a few friends. The first 20-30 million users is expected. It’s not a “revolution”. Getting to 100 million is the challenge. I predict this will not happen based on conversion rates from past product launches and Google’s available user base. Here are the steps:

1. Launch a “limited release” (see below). Get 20-30 million users because this is a reasonable conversion rate from the existing Google products user base to this new network (The people willing to try it out). Adoption appears to be fast because the people willing to try out new Google products in this expected range.

2. Google announces business accounts. All businesses signup because it’s free and they were told it’s a good idea. Because it’s “news” SEO’s report the usual “we are seeing businesses in search results from Googol +”. More businesses sign up.

3. Google ends up with another version of Google local with a very limited amount of social interaction.
People do not want to talk to businesses. It’s called a social network not a commercial network.

4. Subtract out all business accounts from the network. Mass adoption does not occur
because no one was asking for new features inside Facebook. Google + slowly dies off. By the end of 2011 no one is really using it as a social network. People do not mass adopt because of new “features”.

5. The rate of growth by individual users dramatically slows from end of 2011 into 2012. Google fails to hit 100 million individual user accounts. The share rate slows and is unimpressive.

6. The usual social media blogs pronounce it a failure. Google changes focus again.

The power of branding wins again. Don’t underestimate it.

So why do a “limited release” on things like googol +? Not because they are “testing and don’t want to grow to fast”. There is a bootstrapping issue with social networks. You log in…..*No one is there*. Then you go back because you were one of the special “privileged first users” (pawns). You keep going back to check if anyone else is there because you were told “the others are coming”. Otherwise you go in once and you never return. Because no one else uses it. You might even pick up a free advertising for your pawns along they way. There are many ways to seed a network and this is just one technique.

Share and Enjoy:
  • Print this article!
  • E-mail this story to a friend!
  • Facebook
  • TwitThis
  • del.icio.us
  • Digg
  • LinkedIn
  • Ping.fm
  • Sphinn
  • Yahoo! Buzz
  • Technorati
Categories: Social Media Tags:

How I Reverse Engineered Klout Score to an ~ R2 = 0.94

June 27th, 2011 admin No comments

Disclaimer: This my personal blog and the opinions presented here are my own. I founded SEMJ.org, write for Semanticweb.com, and I am the president of Future Farm Inc. The opinions expressed here are not the opinions of the companies I currently represent.

Summary: The formulas I present here showed an R^2 value of 0.94 with 99 data points. A perfectly fit curve would have an R^2 value of 1.0. This means that the variables shown in these equations account for 94% of the variance you would see in a Klout score formula. This project was initially an independent attempt to measure influence on social networks. It turns out the formulas presented here have extremely high correlation with calculations of “influence” published on Klout’s public interface. About 94% of the variation in Klout score can also be estimated using log(retweeets) or log(followers). Since Klout score is a “made up” number the absolute value of these equations does not matter. We do not need to match a Klout score value even though one could with great accuracy. What matters is the relative difference between scores. The intent of this analysis is not to determine if influence scores like Klout are useful or not. The intent is to show that, in their current state, they can be reduced to simple formulas. The equations shown here are sufficient.

Definition of R2:
“R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.”

Social media metrics are a heavily debated and evolving topic. An evolution of social media analytics is expected since an estimated 40% of data being pushed onto the web is coming from social networks and scientists are flooding into the field. Because, as Peter Norvig said, “This is where the data is”. Within the community, and at conferences, there is a lot of discussion on social media metrics. However, published work has failed to gain consensus. Of course, some agencies have their own methods for measurement and they may not wish to publish their techniques. Not all techniques will be published because, after all, marketing is business. Any attempts to improve our ability to measure social media efforts, and correlate them to improved business metrics, should come from what I call “fundamental units” based on “things we can measure.” In general, measurements without units are not useful or adopted in science. For example if I asked you to “run one” you might immediately ask “One what? One mile or one block?” You need the units to make this meaningful.

Arguments have been made against the use of IQ, for example, because IQ has no units. IQ is just a number. Therefore, to many scientists, it is not a credible measurement. If IQ was reported in brains-per-square-inch, something physical, it might be of interest to a physicist. IQ is interesting on the extreme ends of its allowable range. Really low IQ scores and really high IQ scores seem to correlate with behavior we would expect from people in real life. So not all abstract measurements are completely useless. For example, Google’s page rank is reported on a scale from 0-10. Google’s page rank formula, however, is not advertised as a public number for marketers to use. It is used for internal purposes to improve the quality of search results.

Now consider a Klout score. Klout is a proprietary formula that outputs a public number with no units. This has little value to someone working in analytics. A proprietary formula for public use is not going to advance the field of social media metrics. Why? Let’s suppose one tracks Klout scores for their clients and finds strong correlation between Klout scores and other known marketing metrics. If the input variables to the formula are not published, we do not know what to do next. I can’t complete the feedback loop of my measurement efforts because the inputs to the equation, for the Klout score, are not known. If you were to ask people familiar with Klout score “How do you make your Klout score increase?” they would probably respond with “I am not sure.” The likely first guess would probably be followers. Which, as you will see, turns out to be correct.

In 2008, I looked at ways to measure social media and I started looking at “ratios of things” we can measure on networks. Ratios are important because ratios report variables relative to other variables. Metrics should be composed of fundamental units so we can continue to use them across platforms and take them with us. For example, if Twitter does not exist in 10 years we would like to continue to use our metrics to improve our marketing elsewhere. Any attempt to define a fundamental set of social media metrics will have to include ratios of things we can gather from networks. In 2008, we started looking at metrics like [followers/following] and combinations of other interesting ratios. When I first saw Klout scores being discussed, my first thought was “a Klout score is what I would call an ego metric.” Ego metrics are really only useful to the person checking the value. Klout.com has strong marketing on the home page to promote an ego metric feel. It is full of pictures of people with their Klout scores posted above them.

Recently, there has been an increased interest in Klout scores. I decided to look into the values being reported and see how well they lined up with a formula we were working on internally for a forthcoming paper. A very simple measure of influence might look like this:

Influence = k1*(followers/following) + K2(listed/followers)

You could do something interesting like take the log of both terms to get more dynamic range. Initially I wrote the equations out by hand by looking at what types of variables we could plot that make sense for influence. For example, influence being proportional to followers divided by the number of people a user is following makes intuitive sense. I decided to run some regression testing on a basic version of the formulas and known Klout scores. Using an open source statistical program called “R,” and Matlab, we ran some regression tests. Later, while looking for data sets, we ran in to a post by Alex Braunstein and I decided to use his data for efficiency. I noticed that @klout ‘s actual Klout score, as reported, was coming in high in our models so we took it out of the data set. In these tests we are working with 99 points. Klout reports that they use machine learning and a host of other variables to determine a Klout score. Klout may be using machine learning but if the equation reduces to something where 94% of the variation in the scores is covered by a simple equation then there is really no need for all the 2nd and 3rd order effects. Klout score is also quantized into 100 bins so we are losing information anyway.

I will skip all statistical details and show the final plot for my formula, compared to Klout’s reported scores. The left axis is the observed value (Klout score) and the X-axis is our formula’s value. From this graph you can see that our R value is about 0.939. This means that approximately 94% of your Klout score is accounted for by this simple formula. If my formula was perfect you would see all the points line up on the plotted line through the middle of Fig. 1.1.

kloutfollowersFigure 1.1 – Observed Score (Klout score) vs. predicted score (calculated score). Ideally all points would lie on the line in the middle going from 0 to 100.

If you want a formula that measures influence and gives you directional information about your social networks you can use one of the two formulas reported below. Eq. 1.1 has an R value of about 0.75. and Eq. 1.2 has an R value of about 0.94. I will just call these basic formulas “Influence”. Equation 1.1 does not follow Klout as closely as 1.2. However, this does not matter, equation 1.1 might be “good enough” and give you directional information about influence.

Eq. 1.1 Influence = 16.338 + 4.490*log(TwitterFollowers)

To get estimates with an R value near that shown in Figure 1.1 you can use:

Eq. 1.2 Influence = 23.474 - 0.109*log(TwitterFollowers) + 4.838*log(Retweets)
(The coefficients can be change to match Klout. However the absolute value does not matter. What matters is the relative value between two users)

It turns out log(retweets) also gives an R^2 value near 0.94. Figure 1.2 shows how accurate using log(retweets) is when estimating Klout score. Retweets are available using Twitter’s API.

retweets

Fig. 1.2 - Observed Score (Klout score) vs. predicted score (calculated score). Using only log(retweets)
An R value of 0.94 is high and of great interest to anyone that has studied statistics. For example, in sociology you might see R values of 0.3 to 0.4 and these would be considered “interesting.” In physics you will see high R values, like being reported here, because physicists are usually dealing with “less noise” and they are at a more fundamental level in nature.

I saw one outlier, when analyzing this data set, and it was from a Twitter user named @vicgundotra. I believe the reason his score is off, by only 11, is that he has only Tweeted 44 times. This is the only variable we are not accounting for in our formula. A third term, that accounted for the number of Tweets, would make this curve more accurate. However, there is no need to match the Klout score exactly.
As another approximation for influence, I would propose looking at total followers plus the number of followers a users followers have. Thus, measuring a “network effect,” something like Equation 1.3, below, might be of interest and may also correlate with Klout scores:

Eq. 1.3 Influence = K1* log (followers) + K2*log (netfollowing)

Where netfollowing is the total number of people following a users followers.
Feel free to verify this work or email me for my data and the R source code. In the end, there is no need to match a Klout score. If the intent is to get information about influence then equations 1.1 or 1.2 work well and give you directional measures of influence. Or you can modify them with other terms like those suggested in equation 1.3.

If equations reduce to something much simpler, like those presented here, then there is no need to calculate the secondary effects of other variables. The resolution of the system is from 1-100 so there appears to be no practical value in more complicated formulas using this system.

For readers with more advanced statistical knowledge you can do a scatterplot matrix on all the variables reported by Braunstein. Figure 1.3 shows the scatterplot matrix. In this you can see some of the highly linear plots. In fact, log (number_of_retweets) has a variance near that of Figure 1.1.

scatterplot

Fig. 1.3 – Matrix scatter plot of Bernstein’s data using log scales.

Share and Enjoy:
  • Print this article!
  • E-mail this story to a friend!
  • Facebook
  • TwitThis
  • del.icio.us
  • Digg
  • LinkedIn
  • Ping.fm
  • Sphinn
  • Yahoo! Buzz
  • Technorati
Categories: Uncategorized Tags: