Analysis of SEOmoz’s Published Data on Ranking Correlations, Latent Dirichlet Allocation (LDA), and Spearman’s Correlation Coefficient. Google Engineers are Sleeping Well at Night.
Abstract In this article we discuss confidence intervals from sampled data sets and analyze the published statistics from a recent SEOmoz article. We conclude that from a “particular data set”, in this published experiment, we can confidently conclude nothing about the relationship between calculated Latent Dirichlet Allocation (LDA) values and the ranking of a web document’s position in the commercial search engine’s result. The published data forces us to conclude that for any particular search query the SEO practitioner is not sure if a web document should have a low or high LDA number. Using Matlab simulations and the published data we show that the standard deviation reported for the “mean Spearman’s coefficient” is nearly equal to the average coefficient derived from the data sets. To interpret SEOmoz’s data correctly a scientific presentation of the results would include, at a minimum, an assumed confidence level and the calculated number of degrees of freedom.
Analysis of Published Data
After discussing published data from SEOmoz that was referenced on Dr. Garcia’s blog, we are in agreement that the statistical results published looked incorrect and many were asking for more information (i.e. published data). We do not need to see all the data to show the misapplication, but it is helpful so that we can see if the distribution is two-tailed, one-tailed, monotonic, etc. Publishing the exact method of calculation is usually needed to verify the results of a measurement. A colleague emailed this post from SEOmoz. The post is a correlation study using Spearman’s coefficient and LDA calculations. The formulas for calculating LDA are found in this paper by Blei et al. (2003) Techniques for gathering sample sets are described in the paper: “Distributed Query Sampling: A Quality Conscious Approach.” In this article we will focus on the techniques used to analyze the sampled data and not the chosen sets of data.
The statistical formula used in the SEOmoz’s article is Spearman’s rank correlation coefficient. The formula is straightforward but the application and interpretation of it can result in misleading conclusions. Spearman’s coefficient is calculated as:
Where r’ is spearman’s coefficient, N is the size of the sample set, and d is the statistical rank of corresponding variables. At issue is the significance of the published coefficients, the sample sizes, and the validity of reporting a mean from calculated correlation coefficients. It is well-known that correlation coefficients are not additive. Meaning we can’t normally take samples for DIFFERENT experiments and total the correlation values. In this case the sampling that is taking place is from the SERPs and the results are generated form a list of “random” search queries. The sampling for each set (search phrase in which we are calculating LDA), in SEOmoz’s latest experiment, appears to be based on sample sets between 6 and 10.
When examining the values we see in the the published data set we see a two-tailed curve because there are positive and negative values. We can also see some values that are less than zero (negative). A good negative correlation means that an increase in one variable causes a decrease in the other variable. In this case it means that web documents with lower LDA scores are “mostly” ranking above documents in SERPs with higher LDA scores. Following are the possible outcomes from the numbers that result from Spearman’s coefficients:
1. No correlation between the variables
2. Positive correlation between the variables (indicated by positive values).
3. Negative correlation between the variables (indicated by negative values).
When we perform a sample of a data set we must deal with that data set. We cannot create new data sets, showing new correlations and add them to previous data sets. Typically, the minimum number of samples needed to create a Spearman’s coefficient is 4. However, if we create a new search query we have created a new data set and we must analyze the results of that query and determine the correlation on that data set. If this is not the case we would need mathematical proof of this using the known properties of Spearman’s coefficients.
Once we take a sample and calculate Spearman’s coefficient, we then use critical value tables to determine the significance of the sample. See Table 1.0, critical values:
In Table 1.0 (for a two-sided distribution) n is the number of samples we have taken for the CURRENT set we are looking at. So if we have 8 samples (i.e. 8 search results), we must have an r’ of .738 (the column under 5%) to be confident that 95 times out of a 100 the data occurred because a relationship exists and not because of pure chance. If we have 8 samples we can see in table 1.0 that to be confident 99 times out of a 100, the data occurred because a relationship exists and not because of pure chance. Notice that with only 4 samples we can’t achieve the 5% confidence level. Most of the samples in the data sets appear to have 8 samples. Technically, n in table 1.0 is indicating degrees of freedom.
In the study done by SEOmoz there is a low sample set for each search query. On average most appeared to have 8 or less. This means (not factoring in degrees of freedom) we need Spearman values around 0.75 to be 95 % confident that our measurements did not occur by chance. These numbers tell us only how confident we are that there is a relationship between the two variables. R values around .33 with only 6 to 8 samples are considered very weak relationships. It is up to the observer to define reasonable confidence intervals. When comparing to other papers one also has to examine the number of samples taken in those studies. In the Stanford paper referenced by Garcia (an example referenced as a case for publishing extremely low coefficients) they are stating that 0.12 is a very low correlation, not a high one. The following graph shows that our required “significance level” depends on the number of samples in the data set. As we approach 80 samples the correlation coefficient required to put us at a desired significance level drops.
Shown below is a histogram of the published data showing the mean and the standard deviation being about equal.
Using Matlab we created the graph below which shows the Average LDA Ranking versus the position in the search engine. This plot uses ~61 different data sets. It shows an obvious skew for LDA values and the number one position in the SERP. The other positions seem unaffected by the LDA values.
Looking at the data from the report we see a specific example for the search query “dining room sets” the published Spearman’s coefficient for this sample set of 8 documents is -0.619. Table 1.1, below, shows the published values from the study:
This means that for this “particular” query “dining room sets” we see that documents with lower LDA values (.43 is the LDA value for overstock.com’s site) ranking higher than other documents, and we are fairly confident (Spearman Value of -0.619) that, if we want to rank higher on this search phrase, we should have a lower LDA value than a high one for this particular search query. If we choose to discard this sample of data, along with all negative values, we will skew the mean of the coefficients to higher positive values.
After a sample set is taken Spearman’s value should be calculated on the current data set. If the value is not within the confidence interval, the hypothesis is rejected. In previous discussions, the referenced papers are averaging the final results of previously sampled data sets from acceptable data sets that met required confidence levels. Not averaging data sets that produce significantly low coefficients and negative correlations is an attempt to make a “mean coefficient” positive.
Calculation of the Spearman’s Coeffecient Using Tied Ranks
If we were to assume that we can use the published data from SEOmoz as coming from a “single system” (not exactly true but is in effect what has happened by averaging all their values) and that all search queries are treated without particular bias, we expect to be able to use so-called “tied rank” calculations to arrive at a correct Spearman’s coefficient for the entire data set. To do this we could assume about 6 samples per search query and there are 555 sample sets (not all sets had the same number of data points). Using “tied rank” calculations we have to calculate new rank positions. To do this one sums the rank positions and divides by the number of tied cases. To do this calculation one would have to determine the number of degrees for freedom. The formula for Spearman’s coefficient using tied ranks is:
If we can treat the outputs of all sampled sets as coming from a “single system” then we should be able to take all the data and arrive at a coefficient before calculating each Spearman’s coefficient independently.