How I Reverse Engineered Klout Score to an ~ R2 = 0.94
Disclaimer: This my personal blog and the opinions presented here are my own. I founded SEMJ.org, write for Semanticweb.com, and I am the president of Future Farm Inc. The opinions expressed here are not the opinions of the companies I currently represent.
Summary: The formulas I present here showed an R^2 value of 0.94 with 99 data points. A perfectly fit curve would have an R^2 value of 1.0. This means that the variables shown in these equations account for 94% of the variance you would see in a Klout score formula. This project was initially an independent attempt to measure influence on social networks. It turns out the formulas presented here have extremely high correlation with calculations of “influence” published on Klout’s public interface. About 94% of the variation in Klout score can also be estimated using log(retweeets) or log(followers). Since Klout score is a “made up” number the absolute value of these equations does not matter. We do not need to match a Klout score value even though one could with great accuracy. What matters is the relative difference between scores. The intent of this analysis is not to determine if influence scores like Klout are useful or not. The intent is to show that, in their current state, they can be reduced to simple formulas. The equations shown here are sufficient.
Definition of R2:
“R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.”
Social media metrics are a heavily debated and evolving topic. An evolution of social media analytics is expected since an estimated 40% of data being pushed onto the web is coming from social networks and scientists are flooding into the field. Because, as Peter Norvig said, “This is where the data is”. Within the community, and at conferences, there is a lot of discussion on social media metrics. However, published work has failed to gain consensus. Of course, some agencies have their own methods for measurement and they may not wish to publish their techniques. Not all techniques will be published because, after all, marketing is business. Any attempts to improve our ability to measure social media efforts, and correlate them to improved business metrics, should come from what I call “fundamental units” based on “things we can measure.” In general, measurements without units are not useful or adopted in science. For example if I asked you to “run one” you might immediately ask “One what? One mile or one block?” You need the units to make this meaningful.
Arguments have been made against the use of IQ, for example, because IQ has no units. IQ is just a number. Therefore, to many scientists, it is not a credible measurement. If IQ was reported in brains-per-square-inch, something physical, it might be of interest to a physicist. IQ is interesting on the extreme ends of its allowable range. Really low IQ scores and really high IQ scores seem to correlate with behavior we would expect from people in real life. So not all abstract measurements are completely useless. For example, Google’s page rank is reported on a scale from 0-10. Google’s page rank formula, however, is not advertised as a public number for marketers to use. It is used for internal purposes to improve the quality of search results.
Now consider a Klout score. Klout is a proprietary formula that outputs a public number with no units. This has little value to someone working in analytics. A proprietary formula for public use is not going to advance the field of social media metrics. Why? Let’s suppose one tracks Klout scores for their clients and finds strong correlation between Klout scores and other known marketing metrics. If the input variables to the formula are not published, we do not know what to do next. I can’t complete the feedback loop of my measurement efforts because the inputs to the equation, for the Klout score, are not known. If you were to ask people familiar with Klout score “How do you make your Klout score increase?” they would probably respond with “I am not sure.” The likely first guess would probably be followers. Which, as you will see, turns out to be correct.
In 2008, I looked at ways to measure social media and I started looking at “ratios of things” we can measure on networks. Ratios are important because ratios report variables relative to other variables. Metrics should be composed of fundamental units so we can continue to use them across platforms and take them with us. For example, if Twitter does not exist in 10 years we would like to continue to use our metrics to improve our marketing elsewhere. Any attempt to define a fundamental set of social media metrics will have to include ratios of things we can gather from networks. In 2008, we started looking at metrics like [followers/following] and combinations of other interesting ratios. When I first saw Klout scores being discussed, my first thought was “a Klout score is what I would call an ego metric.” Ego metrics are really only useful to the person checking the value. Klout.com has strong marketing on the home page to promote an ego metric feel. It is full of pictures of people with their Klout scores posted above them.
Recently, there has been an increased interest in Klout scores. I decided to look into the values being reported and see how well they lined up with a formula we were working on internally for a forthcoming paper. A very simple measure of influence might look like this:
Influence = k1*(followers/following) + K2(listed/followers)
You could do something interesting like take the log of both terms to get more dynamic range. Initially I wrote the equations out by hand by looking at what types of variables we could plot that make sense for influence. For example, influence being proportional to followers divided by the number of people a user is following makes intuitive sense. I decided to run some regression testing on a basic version of the formulas and known Klout scores. Using an open source statistical program called “R,” and Matlab, we ran some regression tests. Later, while looking for data sets, we ran in to a post by Alex Braunstein and I decided to use his data for efficiency. I noticed that @klout ‘s actual Klout score, as reported, was coming in high in our models so we took it out of the data set. In these tests we are working with 99 points. Klout reports that they use machine learning and a host of other variables to determine a Klout score. Klout may be using machine learning but if the equation reduces to something where 94% of the variation in the scores is covered by a simple equation then there is really no need for all the 2nd and 3rd order effects. Klout score is also quantized into 100 bins so we are losing information anyway.
I will skip all statistical details and show the final plot for my formula, compared to Klout’s reported scores. The left axis is the observed value (Klout score) and the X-axis is our formula’s value. From this graph you can see that our R value is about 0.939. This means that approximately 94% of your Klout score is accounted for by this simple formula. If my formula was perfect you would see all the points line up on the plotted line through the middle of Fig. 1.1.
Figure 1.1 – Observed Score (Klout score) vs. predicted score (calculated score). Ideally all points would lie on the line in the middle going from 0 to 100.
If you want a formula that measures influence and gives you directional information about your social networks you can use one of the two formulas reported below. Eq. 1.1 has an R value of about 0.75. and Eq. 1.2 has an R value of about 0.94. I will just call these basic formulas “Influence”. Equation 1.1 does not follow Klout as closely as 1.2. However, this does not matter, equation 1.1 might be “good enough” and give you directional information about influence.
Eq. 1.1 Influence = 16.338 + 4.490*log(TwitterFollowers)
To get estimates with an R value near that shown in Figure 1.1 you can use:
Eq. 1.2 Influence = 23.474 - 0.109*log(TwitterFollowers) + 4.838*log(Retweets)
(The coefficients can be change to match Klout. However the absolute value does not matter. What matters is the relative value between two users)
It turns out log(retweets) also gives an R^2 value near 0.94. Figure 1.2 shows how accurate using log(retweets) is when estimating Klout score. Retweets are available using Twitter’s API.
Fig. 1.2 - Observed Score (Klout score) vs. predicted score (calculated score). Using only log(retweets)
An R value of 0.94 is high and of great interest to anyone that has studied statistics. For example, in sociology you might see R values of 0.3 to 0.4 and these would be considered “interesting.” In physics you will see high R values, like being reported here, because physicists are usually dealing with “less noise” and they are at a more fundamental level in nature.
I saw one outlier, when analyzing this data set, and it was from a Twitter user named @vicgundotra. I believe the reason his score is off, by only 11, is that he has only Tweeted 44 times. This is the only variable we are not accounting for in our formula. A third term, that accounted for the number of Tweets, would make this curve more accurate. However, there is no need to match the Klout score exactly.
As another approximation for influence, I would propose looking at total followers plus the number of followers a users followers have. Thus, measuring a “network effect,” something like Equation 1.3, below, might be of interest and may also correlate with Klout scores:
Eq. 1.3 Influence = K1* log (followers) + K2*log (netfollowing)
Where netfollowing is the total number of people following a users followers.
Feel free to verify this work or email me for my data and the R source code. In the end, there is no need to match a Klout score. If the intent is to get information about influence then equations 1.1 or 1.2 work well and give you directional measures of influence. Or you can modify them with other terms like those suggested in equation 1.3.
If equations reduce to something much simpler, like those presented here, then there is no need to calculate the secondary effects of other variables. The resolution of the system is from 1-100 so there appears to be no practical value in more complicated formulas using this system.
For readers with more advanced statistical knowledge you can do a scatterplot matrix on all the variables reported by Braunstein. Figure 1.3 shows the scatterplot matrix. In this you can see some of the highly linear plots. In fact, log (number_of_retweets) has a variance near that of Figure 1.1.