Archive

Author Archive

Steps to Google +’s failure & Why do a limited release on Google +?

July 19th, 2011 admin No comments

My shortest post ever. It’s also to document (for fun) my prediction, from day one, that Google + will fail. An argument I am having with a few friends. The first 20-30 million users is expected. It’s not a “revolution”. Getting to 100 million is the challenge. I predict this will not happen based on conversion rates from past product launches and Google’s available user base. Here are the steps:

1. Launch a “limited release” (see below). Get 20-30 million users because this is a reasonable conversion rate from the existing Google products user base to this new network (The people willing to try it out). Adoption appears to be fast because the people willing to try out new Google products in this expected range.

2. Google announces business accounts. All businesses signup because it’s free and they were told it’s a good idea. Because it’s “news” SEO’s report the usual “we are seeing businesses in search results from Googol +”. More businesses sign up.

3. Google ends up with another version of Google local with a very limited amount of social interaction.
People do not want to talk to businesses. It’s called a social network not a commercial network.

4. Subtract out all business accounts from the network. Mass adoption does not occur
because no one was asking for new features inside Facebook. Google + slowly dies off. By the end of 2011 no one is really using it as a social network. People do not mass adopt because of new “features”.

5. The rate of growth by individual users dramatically slows from end of 2011 into 2012. Google fails to hit 100 million individual user accounts. The share rate slows and is unimpressive.

6. The usual social media blogs pronounce it a failure. Google changes focus again.

The power of branding wins again. Don’t underestimate it.

So why do a “limited release” on things like googol +? Not because they are “testing and don’t want to grow to fast”. There is a bootstrapping issue with social networks. You log in…..*No one is there*. Then you go back because you were one of the special “privileged first users” (pawns). You keep going back to check if anyone else is there because you were told “the others are coming”. Otherwise you go in once and you never return. Because no one else uses it. You might even pick up a free advertising for your pawns along they way. There are many ways to seed a network and this is just one technique.

Categories: Social Media Tags:

How I Reverse Engineered Klout Score to an ~ R2 = 0.94

June 27th, 2011 admin No comments

Disclaimer: This my personal blog and the opinions presented here are my own. I founded SEMJ.org, write for Semanticweb.com, and I am the president of Future Farm Inc. The opinions expressed here are not the opinions of the companies I currently represent.

Summary: The formulas I present here showed an R^2 value of 0.94 with 99 data points. A perfectly fit curve would have an R^2 value of 1.0. This means that the variables shown in these equations account for 94% of the variance you would see in a Klout score formula. This project was initially an independent attempt to measure influence on social networks. It turns out the formulas presented here have extremely high correlation with calculations of “influence” published on Klout’s public interface. About 94% of the variation in Klout score can also be estimated using log(retweeets) or log(followers). Since Klout score is a “made up” number the absolute value of these equations does not matter. We do not need to match a Klout score value even though one could with great accuracy. What matters is the relative difference between scores. The intent of this analysis is not to determine if influence scores like Klout are useful or not. The intent is to show that, in their current state, they can be reduced to simple formulas. The equations shown here are sufficient.

Definition of R2:
“R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.”

Social media metrics are a heavily debated and evolving topic. An evolution of social media analytics is expected since an estimated 40% of data being pushed onto the web is coming from social networks and scientists are flooding into the field. Because, as Peter Norvig said, “This is where the data is”. Within the community, and at conferences, there is a lot of discussion on social media metrics. However, published work has failed to gain consensus. Of course, some agencies have their own methods for measurement and they may not wish to publish their techniques. Not all techniques will be published because, after all, marketing is business. Any attempts to improve our ability to measure social media efforts, and correlate them to improved business metrics, should come from what I call “fundamental units” based on “things we can measure.” In general, measurements without units are not useful or adopted in science. For example if I asked you to “run one” you might immediately ask “One what? One mile or one block?” You need the units to make this meaningful.

Arguments have been made against the use of IQ, for example, because IQ has no units. IQ is just a number. Therefore, to many scientists, it is not a credible measurement. If IQ was reported in brains-per-square-inch, something physical, it might be of interest to a physicist. IQ is interesting on the extreme ends of its allowable range. Really low IQ scores and really high IQ scores seem to correlate with behavior we would expect from people in real life. So not all abstract measurements are completely useless. For example, Google’s page rank is reported on a scale from 0-10. Google’s page rank formula, however, is not advertised as a public number for marketers to use. It is used for internal purposes to improve the quality of search results.

Now consider a Klout score. Klout is a proprietary formula that outputs a public number with no units. This has little value to someone working in analytics. A proprietary formula for public use is not going to advance the field of social media metrics. Why? Let’s suppose one tracks Klout scores for their clients and finds strong correlation between Klout scores and other known marketing metrics. If the input variables to the formula are not published, we do not know what to do next. I can’t complete the feedback loop of my measurement efforts because the inputs to the equation, for the Klout score, are not known. If you were to ask people familiar with Klout score “How do you make your Klout score increase?” they would probably respond with “I am not sure.” The likely first guess would probably be followers. Which, as you will see, turns out to be correct.

In 2008, I looked at ways to measure social media and I started looking at “ratios of things” we can measure on networks. Ratios are important because ratios report variables relative to other variables. Metrics should be composed of fundamental units so we can continue to use them across platforms and take them with us. For example, if Twitter does not exist in 10 years we would like to continue to use our metrics to improve our marketing elsewhere. Any attempt to define a fundamental set of social media metrics will have to include ratios of things we can gather from networks. In 2008, we started looking at metrics like [followers/following] and combinations of other interesting ratios. When I first saw Klout scores being discussed, my first thought was “a Klout score is what I would call an ego metric.” Ego metrics are really only useful to the person checking the value. Klout.com has strong marketing on the home page to promote an ego metric feel. It is full of pictures of people with their Klout scores posted above them.

Recently, there has been an increased interest in Klout scores. I decided to look into the values being reported and see how well they lined up with a formula we were working on internally for a forthcoming paper. A very simple measure of influence might look like this:

Influence = k1*(followers/following) + K2(listed/followers)

You could do something interesting like take the log of both terms to get more dynamic range. Initially I wrote the equations out by hand by looking at what types of variables we could plot that make sense for influence. For example, influence being proportional to followers divided by the number of people a user is following makes intuitive sense. I decided to run some regression testing on a basic version of the formulas and known Klout scores. Using an open source statistical program called “R,” and Matlab, we ran some regression tests. Later, while looking for data sets, we ran in to a post by Alex Braunstein and I decided to use his data for efficiency. I noticed that @klout ‘s actual Klout score, as reported, was coming in high in our models so we took it out of the data set. In these tests we are working with 99 points. Klout reports that they use machine learning and a host of other variables to determine a Klout score. Klout may be using machine learning but if the equation reduces to something where 94% of the variation in the scores is covered by a simple equation then there is really no need for all the 2nd and 3rd order effects. Klout score is also quantized into 100 bins so we are losing information anyway.

I will skip all statistical details and show the final plot for my formula, compared to Klout’s reported scores. The left axis is the observed value (Klout score) and the X-axis is our formula’s value. From this graph you can see that our R value is about 0.939. This means that approximately 94% of your Klout score is accounted for by this simple formula. If my formula was perfect you would see all the points line up on the plotted line through the middle of Fig. 1.1.

kloutfollowersFigure 1.1 – Observed Score (Klout score) vs. predicted score (calculated score). Ideally all points would lie on the line in the middle going from 0 to 100.

If you want a formula that measures influence and gives you directional information about your social networks you can use one of the two formulas reported below. Eq. 1.1 has an R value of about 0.75. and Eq. 1.2 has an R value of about 0.94. I will just call these basic formulas “Influence”. Equation 1.1 does not follow Klout as closely as 1.2. However, this does not matter, equation 1.1 might be “good enough” and give you directional information about influence.

Eq. 1.1 Influence = 16.338 + 4.490*log(TwitterFollowers)

To get estimates with an R value near that shown in Figure 1.1 you can use:

Eq. 1.2 Influence = 23.474 - 0.109*log(TwitterFollowers) + 4.838*log(Retweets)
(The coefficients can be change to match Klout. However the absolute value does not matter. What matters is the relative value between two users)

It turns out log(retweets) also gives an R^2 value near 0.94. Figure 1.2 shows how accurate using log(retweets) is when estimating Klout score. Retweets are available using Twitter’s API.

retweets

Fig. 1.2 - Observed Score (Klout score) vs. predicted score (calculated score). Using only log(retweets)
An R value of 0.94 is high and of great interest to anyone that has studied statistics. For example, in sociology you might see R values of 0.3 to 0.4 and these would be considered “interesting.” In physics you will see high R values, like being reported here, because physicists are usually dealing with “less noise” and they are at a more fundamental level in nature.

I saw one outlier, when analyzing this data set, and it was from a Twitter user named @vicgundotra. I believe the reason his score is off, by only 11, is that he has only Tweeted 44 times. This is the only variable we are not accounting for in our formula. A third term, that accounted for the number of Tweets, would make this curve more accurate. However, there is no need to match the Klout score exactly.
As another approximation for influence, I would propose looking at total followers plus the number of followers a users followers have. Thus, measuring a “network effect,” something like Equation 1.3, below, might be of interest and may also correlate with Klout scores:

Eq. 1.3 Influence = K1* log (followers) + K2*log (netfollowing)

Where netfollowing is the total number of people following a users followers.
Feel free to verify this work or email me for my data and the R source code. In the end, there is no need to match a Klout score. If the intent is to get information about influence then equations 1.1 or 1.2 work well and give you directional measures of influence. Or you can modify them with other terms like those suggested in equation 1.3.

If equations reduce to something much simpler, like those presented here, then there is no need to calculate the secondary effects of other variables. The resolution of the system is from 1-100 so there appears to be no practical value in more complicated formulas using this system.

For readers with more advanced statistical knowledge you can do a scatterplot matrix on all the variables reported by Braunstein. Figure 1.3 shows the scatterplot matrix. In this you can see some of the highly linear plots. In fact, log (number_of_retweets) has a variance near that of Figure 1.1.

scatterplot

Fig. 1.3 – Matrix scatter plot of Bernstein’s data using log scales.

Categories: Uncategorized Tags:

Analysis of SEOmoz’s Published Data on Ranking Correlations, Latent Dirichlet Allocation (LDA), and Spearman’s Correlation Coefficient. Google Engineers are Sleeping Well at Night.

September 7th, 2010 admin 20 comments

Abstract In this article we discuss confidence intervals from sampled data sets and analyze the published statistics from a recent SEOmoz article. We conclude that from a “particular data set”, in this published experiment, we can confidently conclude nothing about the relationship between calculated Latent Dirichlet Allocation (LDA) values and the ranking of a web document’s position in the commercial search engine’s result. The published data forces us to conclude that for any particular search query the SEO practitioner is not sure if a web document should have a low or high LDA number. Using Matlab simulations and the published data we show that the standard deviation reported for the “mean Spearman’s coefficient” is nearly equal to the average coefficient derived from the data sets. To interpret SEOmoz’s data correctly a scientific presentation of the results would include, at a minimum, an assumed confidence level and the calculated number of degrees of freedom.

Analysis of Published Data
After discussing published data from SEOmoz that was referenced on Dr. Garcia’s blog, we are in agreement that the statistical results published looked incorrect and many were asking for more information (i.e. published data). We do not need to see all the data to show the misapplication, but it is helpful so that we can see if the distribution is two-tailed, one-tailed, monotonic, etc. Publishing the exact method of calculation is usually needed to verify the results of a measurement. A colleague emailed this post from SEOmoz. The post is a correlation study using Spearman’s coefficient and LDA calculations. The formulas for calculating LDA are found in this paper by Blei et al. (2003) Techniques for gathering sample sets are described in the paper: “Distributed Query Sampling: A Quality Conscious Approach.” In this article we will focus on the techniques used to analyze the sampled data and not the chosen sets of data.

The statistical formula used in the SEOmoz’s article is Spearman’s rank correlation coefficient. The formula is straightforward but the application and interpretation of it can result in misleading conclusions. Spearman’s coefficient is calculated as:

spearmans_coeff1

Where r’ is spearman’s coefficient, N is the size of the sample set, and d is the statistical rank of corresponding variables. At issue is the significance of the published coefficients, the sample sizes, and the validity of reporting a mean from calculated correlation coefficients. It is well-known that correlation coefficients are not additive. Meaning we can’t normally take samples for DIFFERENT experiments and total the correlation values. In this case the sampling that is taking place is from the SERPs and the results are generated form a list of “random” search queries. The sampling for each set (search phrase in which we are calculating LDA), in SEOmoz’s latest experiment, appears to be based on sample sets between 6 and 10.

When examining the values we see in the the published data set we see a two-tailed curve because there are positive and negative values. We can also see some values that are less than zero (negative). A good negative correlation means that an increase in one variable causes a decrease in the other variable. In this case it means that web documents with lower LDA scores are “mostly” ranking above documents in SERPs with higher LDA scores. Following are the possible outcomes from the numbers that result from Spearman’s coefficients:

1. No correlation between the variables
2. Positive correlation between the variables (indicated by positive values).
3. Negative correlation between the variables (indicated by negative values).

When we perform a sample of a data set we must deal with that data set. We cannot create new data sets, showing new correlations and add them to previous data sets. Typically, the minimum number of samples needed to create a Spearman’s coefficient is 4. However, if we create a new search query we have created a new data set and we must analyze the results of that query and determine the correlation on that data set. If this is not the case we would need mathematical proof of this using the known properties of Spearman’s coefficients.
Once we take a sample and calculate Spearman’s coefficient, we then use critical value tables to determine the significance of the sample. See Table 1.0, critical values:

two_tailed

Table 1.0

In Table 1.0 (for a two-sided distribution) n is the number of samples we have taken for the CURRENT set we are looking at. So if we have 8 samples (i.e. 8 search results), we must have an r’ of .738 (the column under 5%) to be confident that 95 times out of a 100 the data occurred because a relationship exists and not because of pure chance. If we have 8 samples we can see in table 1.0 that to be confident 99 times out of a 100, the data occurred because a relationship exists and not because of pure chance. Notice that with only 4 samples we can’t achieve the 5% confidence level. Most of the samples in the data sets appear to have 8 samples. Technically, n in table 1.0 is indicating degrees of freedom.

In the study done by SEOmoz there is a low sample set for each search query. On average most appeared to have 8 or less. This means (not factoring in degrees of freedom) we need Spearman values around 0.75 to be 95 % confident that our measurements did not occur by chance. These numbers tell us only how confident we are that there is a relationship between the two variables. R values around .33 with only 6 to 8 samples are considered very weak relationships. It is up to the observer to define reasonable confidence intervals. When comparing to other papers one also has to examine the number of samples taken in those studies. In the Stanford paper referenced by Garcia (an example referenced as a case for publishing extremely low coefficients) they are stating that 0.12 is a very low correlation, not a high one. The following graph shows that our required “significance level” depends on the number of samples in the data set. As we approach 80 samples the correlation coefficient required to put us at a desired significance level drops.

spearmans_graph2

Figure 1.1: Published by geographyfieldwork.com

Shown below is a histogram of the published data showing the mean and the standard deviation being about equal.

hist-of-lda2

Using Matlab we created the graph below which shows the Average LDA Ranking versus the position in the search engine. This plot uses ~61 different data sets. It shows an obvious skew for LDA values and the number one position in the SERP. The other positions seem unaffected by the LDA values.

lda_correlations

Looking at the data from the report we see a specific example for the search query “dining room sets” the published Spearman’s coefficient for this sample set of 8 documents is -0.619. Table 1.1, below, shows the published values from the study:
dining_set

Table 1.1

This means that for this “particular” query “dining room sets” we see that documents with lower LDA values (.43 is the LDA value for overstock.com’s site) ranking higher than other documents, and we are fairly confident (Spearman Value of -0.619) that, if we want to rank higher on this search phrase, we should have a lower LDA value than a high one for this particular search query. If we choose to discard this sample of data, along with all negative values, we will skew the mean of the coefficients to higher positive values.

After a sample set is taken Spearman’s value should be calculated on the current data set. If the value is not within the confidence interval, the hypothesis is rejected. In previous discussions, the referenced papers are averaging the final results of previously sampled data sets from acceptable data sets that met required confidence levels. Not averaging data sets that produce significantly low coefficients and negative correlations is an attempt to make a “mean coefficient” positive.

Calculation of the Spearman’s Coeffecient Using Tied Ranks

If we were to assume that we can use the published data from SEOmoz as coming from a “single system” (not exactly true but is in effect what has happened by averaging all their values) and that all search queries are treated without particular bias, we expect to be able to use so-called “tied rank” calculations to arrive at a correct Spearman’s coefficient for the entire data set. To do this we could assume about 6 samples per search query and there are 555 sample sets (not all sets had the same number of data points). Using “tied rank” calculations we have to calculate new rank positions. To do this one sums the rank positions and divides by the number of tied cases. To do this calculation one would have to determine the number of degrees for freedom. The formula for Spearman’s coefficient using tied ranks is:

tied

If we can treat the outputs of all sampled sets as coming from a “single system” then we should be able to take all the data and arrive at a coefficient before calculating each Spearman’s coefficient independently.

Categories: SEO Tags:

Does Your Search Marketing Company Need An R&D Department?

November 15th, 2009 admin No comments

Editing for a SEM research journal reveals a great deal about the industry and what agencies are experiencing. This past year I attended some great conferences and made a few observations about keeping up with change in the industry

A few general observations regarding the industry in 2009:

1. The rate of change of information that search marketing agencies must keep up with is accelerating.
2. The rate of change of people learning about SEM and then entering this industry is accelerating.
3. The number of companies wanting search marketing services is accelerating.
4. The number of engineering papers, behind the scenes, being published is accelerating.
5. The number of patents being filed in the industry is accelerating.

The combination of the above items will cause further exponential growth of information in the search marketing space. Is it possible, in the near term, that the number of agencies that can provide effective full-service search marketing services may not keep up with the demand coming from the number of companies that will be requesting these services? For established companies it may appear like there are too many search marketing companies and too many newcomers entering the industry. I think there’s plenty of room for effective full-service search marketing agencies. As businesses become more educated through books, conferences, and online information they naturally will begin to filter out the good from the bad and find the appropriate company for their business. So education is a postitive factor within any market. It removes the inefficiencies of doing business because good companies do not have to work as hard to land new clients. Most likely your clients have already spoken with others or done some reading before they arrive at yours.

To keep up with these changes search marketing agencies should begin to plan and budget for an R&D person or team on the front end of their company going forward. A search marketing agency has a difficult task to perform. It needs to look at a moving target, which it has no control over (the WWW), and break it up into tasks the company can perform for its clients and show real progress to them based on agreed upon, high-level key-performance indicators (KPIs). Taking a highly creative, and always-changing task, and then breaking it into tasks (that you can perform each day so that you can make progress) is difficult. It requires an independent, creative, and analytical type of person. There is little room for people within the company that expect to be told what to do next: People that behave like “traditional employees.”

How a search marketing company defines measurements, what the tasks are, and the tools and techniques they use to accomplish the tasks ultimately determines the company’s real value to its clients. All of these elements are typically the intellectual property (IP) of that company. Thus, the sometimes secretive nature behind the exact implementation of the tasks they define internally. When clients know about the general elements and terminology this will make the SEM agency’s job much easier to explain. There is a balance between educating clients versus jumping into implementation strategies, tools, and techniques. In some cases too much education will cause analysis paralysis in the decision making process. In most consulting agencies, to gain new business, companies will typically highlight their experience, the clients they have worked with, and the results. If these are impressive enough then this will likely be the beginning of a working relationship, assuming the client’s budget, timeline, KPIs, and other elements are in agreement.

If small search marketing companies are struggling to keep up with industry changes can you imagine their clients’ confusion? In publicly traded companies analysts usually expect to see a certain amount of a company’s revenue going into R&D. Investors like to see this because it means their investment, long-term, is being protected in a sense. It indicates that a company is staying on the edge and that it will not become stale and easily replaced by a new company or emerging technologies. If you value the long-term success of your company then you would naturally have the same concerns as an outside investor.

Having this R&D team on the front-end will also enable your internal team members to stay focused on what they do: Implementation of the tasks you are doing for clients. Eventually, of course, the people implementing the company’s strategies will need to be educated on the changes and why the decisions are being made. To stay ahead however, having the right type of person on the front end looking at new opportunities will go a long way in helping the company keep up with and stay ahead of changes.

An R&D team typically looks for strategic instead of tactical ideas. Your clients will most likely want you to perform strategic actions for their company. Strategic ideas are usually ones that will have a long-term effect and are broad in nature. Tactical ideas are typically short, narrow, and easier to discover. Many SEO techniques are tactical in nature. You could spend a lot of time and energy implementing them and then have their utility vanish overnight. There are countless examples failed tactical ideas in the SEM community.

Going forward I encourage you to consider how much time your company is devoting to R&D and how this process works. With the continued accelerating of information in this space you may be wasting valuable company time instead of making real progress. It is easy to become distracted in your attempt to keep up with it. Yes, even for something as undefined as R&D you can put some type of process in place to keep pace with the changes in technology and get the discoveries pushed into the internal workings of your company, a difficult task. Taking time to think about this process may leapfrog your company ahead of the competition.

Categories: Internet Research Tags:

Serendipitous Discovery Quotient (SDQ): The Future of SEO? Or an Abstract Concept?

June 24th, 2009 admin No comments

While attending the Semantic Technology Conference in San Jose last week I was fortunate to spend time with Jamie Taylor of Freebase.com, Kingsley Idehen (OpenLink Software Founder and CEO), Martin Hepp (Creator of Good Relations Ontology), Mark Birbeck (practically one of the creators of RDFa, Aldo Bucchi, and Peter Mika from Yahoo!. Jamie Taylor has volunteered to edit for SEMJ.org and contribute a paper covering the fundamentals of linked data, RDFa, and microformats. The forthcoming paper will attempt to clarify some issues surrounding the different types of structured markup.

semantic-web-conference

lft2rt: Aldo Bucchi, Peter Mika (Yahoo!), Sean Golliher, Kingsley Uyi Idehen, Mark Birbeck (creator of RDF), Juan Sequeda

The differences between microformats and RDF are fairly significant: more on their history, and future later (A separate discussion is required to explain why RDF makes more sense than standard database tables and descriptions of data.) A collision of sorts appears to be coming. Or some type of merger? It’s hard to tell at this point. Understanding who the players are and their motivations is key to understanding the decisions that are being made. Microformats are centralized, meaning developers have to get approval for their microformat from microformats.org. RDF, on the other hand, does not require approval from a central organization. RDF namespace is based on XML namespacing. This split appears to be making markup confusing for developers and may slow progress for the developer community. Hopefully decisions are based on what is best for the future of the web. A search engine built off the linked-data concept would cost, perhaps, thousands of dollars rather than in the millions or billions it would take to compete now. The DIP for building a SE was created a long time ago.

So far, in the SEM community, there has been only basic mention of microformats but no real debate or discussion on their meaning and what is driving them. There is much more to the story than appears on the surface.

What Google and Yahoo! are doing now to support this markup is great. But they miss the point about open-linked data and what innovators in this space are pushing for. Showing snippets from RDFa and microformats in search results ultimately misses the concept behind the linked-data movement. The SEs are still linking blobs of unstructured text to other blobs of unstructured text through hyperlinks. This works well: statistically based search works. People are happy with it. Like Thomas Tague said in his recent keynote, “ Is this the answer to the question that no one is asking?” There were enough practical examples at this year’s conference to show that this research is producing real applications, and one can see the benefits when speaking with developers about their projects and viewing presentations like the one given by Martin Hepp.

These small efforts by the SEs to present structured markup in the SERPs show that the SEs are paying attention and beginning to invest serious resources to participate in these new breakthroughs. Google and Yahoo! had 6 to 8 researchers at the Semantic Technology Conference. Ask and Bing also brought researchers. Many were dissapointed with the semantic search panel. It was apparent that defining semantic search was an issue. RDF may be a construct for helping to create semantic search but it is not semantic search.

Watch the video by Tim Berners-Lee to gain understanding of the concepts behind linked data:

Tim Berners-Lee Speaking on Linked Data at TED conference

Notice how simple concepts about the web were hard to understand then. Similar movements are boiling up now, and we have to spend time thinking about these ideas before we can grasp what linked data is all about.

One can think of WWW research and web science like this: Internet researchers can fundamentally change the way the web works. It’s like manipulating atoms versus molecules. If you’re manipulating atoms you can change the potential for end products that get realized for consumers. You are changing the platform that people operate with. But the ideas associated with this type of thinking are hard to communicate.

Try predicting what the web will be like in 20 years. For example, Apple created the following video in 1987 (Google Beta was in 1998). It seems obvious now but try to think of a concept that may be out 20+ years. What you think of probably is closely tied to current technology. The video was presented by Tom Gruber of www.siri.com . It shows great vision by Apple in 1987 (I also noticed they are predicting the future existence of Bill Nye the science guy)

Apple’s Concept of the Knowledge Navigator from 1987.

Following is a depiction of participation in the linked-data movement:

linked-open-data

State of the linked data cloud as of March 2009.

I recommend looking at Kingsley’s company . OpenLink Virtuoso is the RDF database behind DBpedia and most of the bubbles in this LOD-Cloud.

Tim Berners-Lee published guidelines for publishing data to the WWW. Linked data uses structured markup to connect information. Linked data relies on Uniform Resource Indicators (URIs) and the HTTP protocol. URIs are a more general term for referencing things on the web. The preceding web of data is described in a structured way using RDF (not microformats). In this scheme you should be able to deference a URI using HTTP protocol and uncover RDF representations.

Using RDF allows applications and users to navigate through different nodes on the web and discover new relationships. Unlike hyperlinked relationships you can describe specific sections of your data (or HTML document) and show that it is related to other specific RDF descriptions on the web. Using hyperlinks you can show only that one entire document is related to another entire document, which is very limiting. There are fair arguments taking place that are calling for the government, and others, to push their data out in any format they have now. It is not realistic to expect the government to convert all their data to RDF immediately.
If you want to query this linked-data cloud you can go to the SPARQL endpoint at Squin

Currently only the innovators appear to be participating. Followers will come along once they see greater participation and are told they need to participate. There were recent announcements by the BBC, Best Buy, NY Times, and the UK government to publish their data in the linked-data format. The “perfect storm” may be brewing with all these participants and the major search engines showing interest.

Kingsley Idehen (the originator of the SDQ concept and a leader in the linked data movement) and I discussed SEO’s future, which he calls the Serendipitous Discovery Quotient (SDQ), for the linked-data movement. SDQ has been described as precise while SEO is inherently inprecise.

A few definitions are in order:

Serendipidty: The effect by which one accidentally discovers something fortunate, especially while looking for something else entirely.

Quotient: Divide two numbers and you have a quotient.

A few questions come to mind:

1. What would we be dividing in such a data space?

2. In the linked open data cloud you are not really discovering things by “accident.” You can navigate to long-tail keywords through structured data connections. In a sense it is up to you how you navigate and uncover relationships. So you may “feel lucky” to easily find what you were looking for, and your search is limited only by your ability to uncover relationships. Discovery is a good term to use in the definition.

To grasp the concept a graph showing the x and y axes will help. This was originally proposed by Kingsley Idehen. The following graph shows a plot of link density versus relevance.


SDQ-graph

Image Provided by Kingsley Idehen

As a sanity check you see that it proportionally makes sense. The graph of a straight line is described as y = mx (slope of line is m = y/x). So if y is link density and x is relevance, then the slope of a line on this graph is Link Density/Relevance. So if link density goes up the line moves to the graph’s upper left quadrant. SDQ would have to be the slope of a straight line if defined in this way.

How do we possibly measure either link density or relevance? After all, we have to divide numbers. How many “buy” transactions occur via serendipitous binding? How many such bindings coalesce around specific URIs relative to others? In due course a website (aka Data Space) owner will be able to analyze how many serendipitous bindings occur per second/minute/hour/day/month/year and also determine the units of revenue per such bindings.

Relevance would be measured as a conversion or task completion based on the URI associated with a shopping cart, which GoodRelations Ontology and others provide in the future. If a user is not completing tasks or buying what they are looking for (ecommerce), then a company is not linked in correctly via linked open data, and relevance is off.

You also can imagine relevance algorithms that use billions or trillions of nodes describing triplet data. With this many edges you can see that there are potentially nodes to detect spam against a trusted set of initial data.

Based on these definitions and the graph there probably will be adjustments to this calculation. This adjustment is expected because the concept was recently defined. It is true, in this model, that if link density increases, relevance should go up (getting more linked into the open data cloud), and if relevance decreases this would imply that your link density is also low (assuming we have a good set of accurate data). A similar concept is also true of page rank: It scales as more data is added to the system.

My observation is that there is an optimum value for this measurement. The value = 1 or some constant. This would keep the line right through the center of this graph. If you are less than one you are low on link density but still high on relevance. However, by not being linked in, your website will not receive optimum traffic, so conversions will be low.

If your website’s link density is high and relevance, or conversions and task completion measurements, are low then you need to work on better descriptions of your data and optimize towards the constant SDQ value. This type of line also could indicate a datum that is being incorrectly described (i.e., spam or manipulation).

If this concept proves to be viable then, in the future, SEOs will optimize for an industry-standard SDQ value that is measurable and understood by the optimizer and the customer. The optimization process will be towards a balance of binding and relevance. Similar to SEO today but possibly more measurable.

Stay tuned for more details on ideas related to linked data, microformats, and RDF.

Categories: Linked Data Tags: , ,