While attending the Semantic Technology Conference in San Jose last week I was fortunate to spend time with Jamie Taylor of Freebase.com, Kingsley Idehen (OpenLink Software Founder and CEO), Martin Hepp (Creator of Good Relations Ontology), Mark Birbeck (practically one of the creators of RDFa, Aldo Bucchi, and Peter Mika from Yahoo!. Jamie Taylor has volunteered to edit for SEMJ.org and contribute a paper covering the fundamentals of linked data, RDFa, and microformats. The forthcoming paper will attempt to clarify some issues surrounding the different types of structured markup.
lft2rt: Aldo Bucchi, Peter Mika (Yahoo!), Sean Golliher, Kingsley Uyi Idehen, Mark Birbeck (creator of RDF), Juan Sequeda
The differences between microformats and RDF are fairly significant: more on their history, and future later (A separate discussion is required to explain why RDF makes more sense than standard database tables and descriptions of data.) A collision of sorts appears to be coming. Or some type of merger? It’s hard to tell at this point. Understanding who the players are and their motivations is key to understanding the decisions that are being made. Microformats are centralized, meaning developers have to get approval for their microformat from microformats.org. RDF, on the other hand, does not require approval from a central organization. RDF namespace is based on XML namespacing. This split appears to be making markup confusing for developers and may slow progress for the developer community. Hopefully decisions are based on what is best for the future of the web. A search engine built off the linked-data concept would cost, perhaps, thousands of dollars rather than in the millions or billions it would take to compete now. The DIP for building a SE was created a long time ago.
So far, in the SEM community, there has been only basic mention of microformats but no real debate or discussion on their meaning and what is driving them. There is much more to the story than appears on the surface.
What Google and Yahoo! are doing now to support this markup is great. But they miss the point about open-linked data and what innovators in this space are pushing for. Showing snippets from RDFa and microformats in search results ultimately misses the concept behind the linked-data movement. The SEs are still linking blobs of unstructured text to other blobs of unstructured text through hyperlinks. This works well: statistically based search works. People are happy with it. Like Thomas Tague said in his recent keynote, “ Is this the answer to the question that no one is asking?” There were enough practical examples at this year’s conference to show that this research is producing real applications, and one can see the benefits when speaking with developers about their projects and viewing presentations like the one given by Martin Hepp.
These small efforts by the SEs to present structured markup in the SERPs show that the SEs are paying attention and beginning to invest serious resources to participate in these new breakthroughs. Google and Yahoo! had 6 to 8 researchers at the Semantic Technology Conference. Ask and Bing also brought researchers. Many were dissapointed with the semantic search panel. It was apparent that defining semantic search was an issue. RDF may be a construct for helping to create semantic search but it is not semantic search.
Watch the video by Tim Berners-Lee to gain understanding of the concepts behind linked data:
Notice how simple concepts about the web were hard to understand then. Similar movements are boiling up now, and we have to spend time thinking about these ideas before we can grasp what linked data is all about.
One can think of WWW research and web science like this: Internet researchers can fundamentally change the way the web works. It’s like manipulating atoms versus molecules. If you’re manipulating atoms you can change the potential for end products that get realized for consumers. You are changing the platform that people operate with. But the ideas associated with this type of thinking are hard to communicate.
Try predicting what the web will be like in 20 years. For example, Apple created the following video in 1987 (Google Beta was in 1998). It seems obvious now but try to think of a concept that may be out 20+ years. What you think of probably is closely tied to current technology. The video was presented by Tom Gruber of www.siri.com . It shows great vision by Apple in 1987 (I also noticed they are predicting the future existence of Bill Nye the science guy)
Following is a depiction of participation in the linked-data movement:
I recommend looking at Kingsley’s company . OpenLink Virtuoso is the RDF database behind DBpedia and most of the bubbles in this LOD-Cloud.
Tim Berners-Lee published guidelines for publishing data to the WWW. Linked data uses structured markup to connect information. Linked data relies on Uniform Resource Indicators (URIs) and the HTTP protocol. URIs are a more general term for referencing things on the web. The preceding web of data is described in a structured way using RDF (not microformats). In this scheme you should be able to deference a URI using HTTP protocol and uncover RDF representations.
Using RDF allows applications and users to navigate through different nodes on the web and discover new relationships. Unlike hyperlinked relationships you can describe specific sections of your data (or HTML document) and show that it is related to other specific RDF descriptions on the web. Using hyperlinks you can show only that one entire document is related to another entire document, which is very limiting. There are fair arguments taking place that are calling for the government, and others, to push their data out in any format they have now. It is not realistic to expect the government to convert all their data to RDF immediately.
If you want to query this linked-data cloud you can go to the SPARQL endpoint at Squin
Currently only the innovators appear to be participating. Followers will come along once they see greater participation and are told they need to participate. There were recent announcements by the BBC, Best Buy, NY Times, and the UK government to publish their data in the linked-data format. The “perfect storm” may be brewing with all these participants and the major search engines showing interest.
Kingsley Idehen (the originator of the SDQ concept and a leader in the linked data movement) and I discussed SEO’s future, which he calls the Serendipitous Discovery Quotient (SDQ), for the linked-data movement. SDQ has been described as precise while SEO is inherently inprecise.
A few definitions are in order:
Serendipidty: The effect by which one accidentally discovers something fortunate, especially while looking for something else entirely.
Quotient: Divide two numbers and you have a quotient.
A few questions come to mind:
1. What would we be dividing in such a data space?
2. In the linked open data cloud you are not really discovering things by “accident.” You can navigate to long-tail keywords through structured data connections. In a sense it is up to you how you navigate and uncover relationships. So you may “feel lucky” to easily find what you were looking for, and your search is limited only by your ability to uncover relationships. Discovery is a good term to use in the definition.
To grasp the concept a graph showing the x and y axes will help. This was originally proposed by Kingsley Idehen. The following graph shows a plot of link density versus relevance.
As a sanity check you see that it proportionally makes sense. The graph of a straight line is described as y = mx (slope of line is m = y/x). So if y is link density and x is relevance, then the slope of a line on this graph is Link Density/Relevance. So if link density goes up the line moves to the graph’s upper left quadrant. SDQ would have to be the slope of a straight line if defined in this way.
How do we possibly measure either link density or relevance? After all, we have to divide numbers. How many “buy” transactions occur via serendipitous binding? How many such bindings coalesce around specific URIs relative to others? In due course a website (aka Data Space) owner will be able to analyze how many serendipitous bindings occur per second/minute/hour/day/month/year and also determine the units of revenue per such bindings.
Relevance would be measured as a conversion or task completion based on the URI associated with a shopping cart, which GoodRelations Ontology and others provide in the future. If a user is not completing tasks or buying what they are looking for (ecommerce), then a company is not linked in correctly via linked open data, and relevance is off.
You also can imagine relevance algorithms that use billions or trillions of nodes describing triplet data. With this many edges you can see that there are potentially nodes to detect spam against a trusted set of initial data.
Based on these definitions and the graph there probably will be adjustments to this calculation. This adjustment is expected because the concept was recently defined. It is true, in this model, that if link density increases, relevance should go up (getting more linked into the open data cloud), and if relevance decreases this would imply that your link density is also low (assuming we have a good set of accurate data). A similar concept is also true of page rank: It scales as more data is added to the system.
My observation is that there is an optimum value for this measurement. The value = 1 or some constant. This would keep the line right through the center of this graph. If you are less than one you are low on link density but still high on relevance. However, by not being linked in, your website will not receive optimum traffic, so conversions will be low.
If your website’s link density is high and relevance, or conversions and task completion measurements, are low then you need to work on better descriptions of your data and optimize towards the constant SDQ value. This type of line also could indicate a datum that is being incorrectly described (i.e., spam or manipulation).
If this concept proves to be viable then, in the future, SEOs will optimize for an industry-standard SDQ value that is measurable and understood by the optimizer and the customer. The optimization process will be towards a balance of binding and relevance. Similar to SEO today but possibly more measurable.
Stay tuned for more details on ideas related to linked data, microformats, and RDF.