Loudon Stearns Abstract
The next stage of the web, as outlined by the “Semantic Web Roadmap” is to embed “meaning” into the web itself. This essay provides a reflection on the current, near future, and far future of the hyperlink as it evolves from a simple connecting device to the connective tissue of a “devastatingly powerful” semantic web. Please view the web version of this document at http://loudonstearns.com/thelink.
Keywords: Semantic Web, link, hyperlink, meaning, URI, ontology, OWL
When a system of “meaningless” symbols has patterns in it that accurately track, or mirror, various phenomena in the world, then that tracking or mirroring imbues the symbols with some degree of meaning---indeed, such a tracking or mirroring is no less and no more than what meaning is. Depending on how complex and subtle and reliable the tracking is, different degrees of meaningfulness arise.(Douglas Hofstadter Godel Escher Bach(P-3)).
Sufficient associations between a pattern of symbols and a reality form meaning. Look at a map, lines and dots on a page are a scaled representation of the world around us. A map has meaning because the symbols on the page mirror or track real-world spatial locations. We can understand the relationships between the map and the world and decide follow a direction to water or food.
Though a bit more abstract, language forms meaning the same way. Words represent real-world things, and sentences connect those words in ways that represent the real-world connections between things. Semiotics approaches this topic by calling the dots and lines the signifier, the real-world things the signified, and the combination of the two the sign.
A single symbol is not enough to form meaning. It takes a pattern of symbols to form meaning. If I place a single dot on a page, that dot can represent anything, it represents nothing. As I add more dots, and connect those dots, meaning can arise. This simple fact is the basis for countless “connect the dots” children’s games. When a meaningless pile of dots are linked in a way that represents a real-world system meaning forms, and children around the world delight in their understanding of even this simple formation of meaning.
“Understanding,” like “meaning” is tricky. Most definitions of “understand” assume a human understanding, but as we delve further into the future of information access we should at least consider a definition of understanding that is not reliant on biology. Lets try this definition: The potential to act on meaning. And, acting based on meaning provides proof of understanding.
It is common to program a computer to manipulate symbols or signifiers. That is largely what current web technology does. If you type “http://loudonstearns.com/thelink” into your web browser’s URI bar your you will be presented with a short movie and this essay. As of April 27, 2014, your computer will present symbols including text, images, and a short movie, but will not understand their meaning. This action does however demonstrate one important building block of meaning already existing on the internet: the link. It is the goal of the future of the web to use links to establish the patterns necessary for meaning to arise and for the web itself to understand and act upon that meaning.
Tim Berners-Lee coined the term “Semantic Web” to describe a future of the web where the web itself understands and acts on the meaning of the web. His 1998 document “Semantic Web Road map” describes: “a sequence for the incremental introduction of technology to take us, step by step, from the Web of today to a Web in which machine reasoning will be ubiquitous and devastatingly powerful.”
Tim Berners-Lee and his World Wide Web Consortium develop the web standards that allow the internet to function. Their mission is to “lead the Web to its full potential” and the general model they are using to develop “Machine-Understandable information: Semantic Web” is the Resource Description Framework(RDF):
RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link.
URI stands for “Uniform Resource Indicator” and it is described by the W3C as the “Global Identification System” on the web. It is a name that we give something which usually exists the web like “http://loudonstearns.com/thelink”. But, a URI is more general. It can reference something that is not information and possibly not even human-readable.
A URI could reference a concept, like “movie director.” But then what should be returned? A list of famous movie directors? A plain-text dictionary-like definition of a movie director? Those are helpful for us humans to understand what a “movie director” is, because we have seen movies, understand human relationships, and can read language(we have a mental model of the semantic web that is a language). But for a computer to act upon the meaning of “movie director” it is useless. What if, instead, it led to a database of relationships? A place where “director” is a property of “person” which is in turn a “thing”. A place where “director” is used in the types “Movie” and “TVEpisode”(both of which are subsets of “CreativeWork”). Referencing a space like this gives the computer a chance of understanding the data.
This space within the semantic web is an “ontology” or “schema” and in addition to RDF is a central component of the semantic web. To play with an ontology check out freebase.org or schema.org. I say “play” consciously, as the best way to get to know these concepts is to explore the space, get a feel for them. See how the structure of the ontology mirrors or tracks the world around you.
If the Semantic Web is to act on the meaning of data, it must be able to represent simple real world things. Regardless of if it is true or not, what does a statement like “Loudon Stearns is a movie director” mean to us, how do human’s parse it, and what can it mean to a computer?
From the syntax(the choice and organization of characters) we see the capital letters in “Loudon Stearns,” and we also recognize those as strange non-dictionary words, so they represent a specific person, they are a person’s name. “is a” actually has very vague and shifty meaning, but we also see “movie director” which we relate to a profession(not a specie of animal). So, to understand first we collect some information about individual parts of the statement.
We break it into parts:
[Loudon Stearns] [is a] [movie director]
Then add more connections to the parts:
[Loudon Stearns(person)] [is a(has the job title)] [movie director(profession)].
Notice how we needed to add information about the information in order to extract meaning. The stuff in parenthesis is “metadata”(data referring to or explaining other data). The Semantic Web consists of metadata added to our information in a way that the computer can understand. In a way the entire Semantic Web is a collection of interrelated metadata stored in a rigid formalized way: as collections of triples.
“Loudon Stearns is a movie director.” Is a very simple sentence, the simplest type, a “clausal sentence” or sentence with a single clause. The type of sentence that structures the entire semantic web. Quite possibly it is the “quanta” of the semantic web, and it is known as a “triple.” We can see the three parts and give them names:
Subject: Loudon Stearns
Predicate: has the jobTitle
Object: movie director.
The relation to grammar is interesting to explore. In a way the RDF creates a grammar for the Semantic Web based on these simple clausal sentences. It is useful to compare this to our own language particularly when considering the definition of meaning we are working with: “Sufficient associations between a pattern of symbols and a reality form meaning.”
Language is a pattern of symbols used to represent reality. In the world around us we find things(nouns) and how those things act(predicates). The Semantic Web is an attempt to create a similar representation of reality within a computer network. It is comforting that we will find similarities between language and the Semantic Web. Because of this similarity we have used similar language to describe both. This can lead to confusion though with terms like “predicate” which in traditional grammar would include “has the job title movie director” but in this context the “predicate” is just “has the job title.”
The context we find ourselves in here is “graph theory,” a subset of math that describes things(vertices or nodes) and relationships between them(edges). More specifically, we are dealing with a subset of graph theory, a “directed labeled graph.”
It is useful to break this down a bit:
A simple graph has vertices and edges: +-------+
To make this graph(map) more useful we label the vertices: S------O
This could reference Sally’s house and Owen’s house.
To make this more useful we identify that it is a one way street: S------>O
This vertex-labeled directed graph is a representation of links in the the non-semantic web: A link provides a one way connection from Sally’s information to Owen’s information. All information about how Sally’s and Owens’s information resides in the human-readable context on the page and is not present in a computer readable format.
Finally, we give this a street name, Princeton street connects the houses: S---P--->O
This directed graph, with labeled vertices and edges, is a basic structure of the semantic web: Subject---Predicate---->Object groups coded in a computer-readable way, collected into databases and pages and connected through shared ontologies(networks of related predicates).
We can expand our diagram a bit here:
Understanding coded triples is key for a search engine to return information and display it in a usable way. Ever search on Google for a business and find that the important information like opening time and contact info appears in the google search, without even having to go to the website itself? That is made possible by the business’s web designer marking up site code with computer readable metadata. Lets look at a simple example from a movie theatre website:
<div class="theatre-info" itemscope itemtype="http://schema.org/MovieTheater">
This line states that this section(div) of the website will contain the theatre’s information. “itemscope itemtype="http://schema.org/MovieTheater"” says that the MovieTheatre section of the schema.org ontology will be used to mark up this information in a computer readable way.
Later in the code of the page we find the place where the movie’s phone number is displayed:
<p itemprop="telephone" class="value">555-555-5555</p>
This line states that the phone number here is a telephone number as described in the MovieTheatre section of schema.org. If you go to http://schema.org/MovieTheater you will find that there is a telephone property listed there.
The Google search engine knows the schema.org ontology. Actually, Google, Bing, and Yahoo! teamed up to create this website because they all understand how vital a central ontology is for the web as a whole. When any of the search engines scour the web they specifically look for semantic web code like “itemscope” and “itemprop” so that they can enhance their results. This is one area of the semantic web that you have benefited from already, but it is just the beginning. Actually, this is a fairly limited application dubbed “microdata” which uses a slightly different syntax than RDF, but the goals are the same: define which ontology will be used and label the data with terms defined in the ontology.
This is directly related to our earlier example of the sentence “Loudon Stearns is a movie director.” We broke it apart into its components, and applied metadata to it. In a similar way, the webmaster breaks the page into components with standard HTML tags like <div><p>and <a> and uses the semantic web standards to apply metadata which the computer can understand.
On the movie’s website all that is visible is the phone number, but because the metadata was added, the search engine understands this simple proposition ”This theatre has the phone number 555-555-5555.” I say “understands” even in this simple context because the computer is able to use the information to do something useful: display the theatre’s phone number even before displaying the entire page.
That simple application of microdata and RDF is pervasive and being applied on many major data-heavy websites. For example, looking at the code of the page of a random actor on IMDB.com finds 56 “itemprop=” properties.
Lets take this to a more general search. What if I were to look up a movie that was just released in theatres: Transcendence. The big obvious results are all pulled from semantic web information. We see a sidebar that includes info from IMDB.com and Rottentomatoes.com, a description from wikipedia.org, and movie times from a local theatre. It is as if Google is showing off what it “knows.” Google has been programed to give preference to the information it understands(has been marked up with RDF). For those of us in the arts this may be an important lesson for us: Marking our information with semantic web standards raises its visibility in popular web browsers. To dig into how Google finds this information look at Freebase, Google’s own “knowledge base,” an ontology plus related content, which is constantly growing and publicly available.
But, we haven’t yet dug into the real power of the semantic web: OWL.
OWL stands for Web Ontology Language(yes, the letters are mixed up, an example of one of the many charming acronym games that programmers like to play) and it is the backbone of ontologies. This is the language that defines how the predicates relate. For an example of how this could be used, lets examine a near-future possibility: I might be thinking “I know I want to see that Johnny Depp movie…” what if I search for: “Movie times for a Johnny Depp movie.” Right now I just get a list of news articles, most of which have “Transcendence” in the title(not a semantic result). That’ll get me there with a few more clicks, but if Google employed a bit more semantic web logic I would get just what I wanted: the movie times for “Transcendence” at the theatre nearest me.
This would require a bit of understanding the meaning of my search terms and also connecting the meaning of many different databases. Most importantly, it would require using “predicate logic” to coordinate multiple sets of information. Google might search the IMDB database to find that Johnny Depp has a list of movies he was in, then would compare that list to movies playing in theatres to see which movie I might be talking about, then would notice that Transcendence is playing at many theatres, then find the theatre nearest me, then present me with the times at my local theatre. There are many difficulties in following this path though. It may be that the IMDB database and the movie theatre database utilize the same ontology to markup their data, but it might not. IMDB might use the dublincore.org ontology, and the theatres use schema.org ontology. This is where OWL and the concept of “Linked Data” comes in.
For us to have a useful “web of meaning,” as is the dream of the Semantic Web, all our information must be able to be compared, compiled, and calculated on. By building ontologies with a standard tool (OWL) they can be compared, connected and linked. A set of simple statements like “name in the Dublin Core is equivalent to title in schema.org,” allows ontologies, and the information they describe, to be connected, linked and compared.
This is largely how Freebase.org has grown so fast, it connects to many other ontologies and knowledgebases like dbpedia.org, an RDF tagged version of wikipedia’s information. Linked data has been one of the most important goals for the web from when it was first considered in Vannevar Bush’s seminal paper “As we think” and is a major rallying point for Tim-Berners Lee, probably the most important man in developing the web as we know it and the web as it will be. Please explore linkeddata.org to see how linked data is already working, and see Tim-Berners Lee’s TED talk “The Next Web” to hear a bit about his vision of the near-future of the web. It is important to note that the W3C led by Tim-Berners Lee, develops the standards by which the web works, and most of what is being referred to here, like the RDF and OWL languages, were developed by this group. Also, much of the information included in this document was learned through researching the W3C website.
I believe the reasons to connect data are obvious, particularly if one considering scientific research, medical/biological research, and finance, which are areas where the semantic web is already very active. But, I have not been focusing on those in this essay. Instead, I have been talking about seemingly trivial things: movies and phone numbers. Scientists and tax accountants are a special class of people, their profession is finding the best information, so they will surely create the best way to access and parse the best information. Most of us are not scientists or tax accountants. We don’t take the time to find the best information, instead we find the easiest information. This is a proven, and troubling fact, and one that the semantic web will need to come to terms with.
A new field of science has been developing and growing largely from Library Information Science, called “Information Behaviour.” Conceptualized as how people need, seek, manage, give, and use information in different contexts(Theories of Information Behaviour XIX). It seems that this new science has not yet provided us with many absolutes, instead it is at the point of providing theories, metatheories, and models that are being tested and built upon. There is one strong result in the field of Information Behaviour though, the Principle of Least Effort, which states that “people invest little in seeking information, preferring easy-to-use accessible sources to sources of known high quality that are less easy to use and/or less accessible”(Theories of information behaviour 4). The Principle of Least Effort must be considered when developing a future of information access. It is easy to become enamored with Tim-Berner Lee’s information-utopian future where all information is stored and searchable online, but it is a much different challenge if we understand that people, even if they know how to find high-quality information, will use lesser-quality easily-accessible information. With this in mind I would suggest that we establish a new goal for the semantic web: to make the highest quality information the most easily accessible information.
This has two parallel design challenges: designing a system that can identify high quality information, and designing an interface that makes getting there easy. The first challenge, identifying high-quality information, is quite a tricky one. The quality of any piece of information is very difficult to calculate and is based on many parameters. Because of this Information Quality “IQ” is a science and study all to itself. This is seen in many companies now having a Chief Data Officer(CDO) with the job of collecting and establishing metrics for Information Quality with their internal and with so-called “Big Data” that comprises the larger linked databases that form the semantic web.
Information Quality is not a fixed or absolute metric tied to a piece of information, it must be calculated at the moment of search and be related to who the searcher is and the context of the search. We should strive for a search to be: a re-organization of all the world’s information based on the needs of the user which places the highest quality information in the easiest to reach location.
The method to calculate “information quality” has become the field of competition between search engines, and it becomes the way in which they differentiate each other, even though they are working from the same pool of linked data. We are already seeing search engines attempting to place the best information first, but it seems to be more “showing off” than actually attempting to calculate an “information quality” quotient. This will be an area of healthy growth in the coming years. OWL and RDF will be major players in this development, for within the OWL and RDF specifications is the ability to do internal checks, there is a logic layer that spot inaccuracies or conflicts among the data, and OWL defines how the data can be related.
We have data all over the place, it is now linked, but is it representing reality? Not until we carefully define how the metadata relates, or: what are the rules of the data relationships?
A simple example:
Loudon is the biological child of Mark.
A simple triple that could be codified in RDF and connected to an ontology.
Loudon is the biological child of Madeleine.
Another simple triple that could be codified in RDF and connected to an ontology.
Loudon is the biological child of Stephen.
Another simple triple that could be codified in RDF and connected to an ontology.
Wait a minute, these can’t all be true! We know that because we understand the meaning of “son,” but the computer only knows that these are mutually exclusive if it is coded in the ontology using the OWL language. In this case, in designing the ontology it would need to be explicitly stated that a person can only have only two “biological child of” properties. This could be taken further in the OWL to state that one “biological child of” properties must be male and the other female. With this information the computer could recognize that either Mark or Stephen is false information. In this way, carefully coded ontologies will be made to check themselves and identify problems, and this type of logic is even more important when combining larger, more complex, datasets. As the ontologies become more and more sophisticated they begin to represent reality closer and closer, but even that isn’t the most amazing part of the Semantic Web: Inference.
Broadly speaking, inference on the Semantic Web can be characterized by discovering new relationships. On the Semantic Web, data is modeled as a set of (named) relationships between resources. “Inference” means that automatic procedures can generate new relationships based on the data and based on some additional information in the form of a vocabulary, e.g., a set of rules. -http://www.w3.org/standards/semanticweb/inference#examples
Once the computers understand the information, then through automatic procedures they can “discover” new relationships. This ability to infer is the computer actually working with the meaning of the symbols and not just working with the symbols themselves. Inference is proof of understanding, and in turn proof of meaning.
Another very simple example:
Loudon is the son of Mark
is a triple related to an ontology.
The ontology contains this rule:
If X is the child of Y, and Y is male, then Y is the father of X.
With this, the computer would infer that Mark is the father of Loudon.
That is a very simple example, but it marks only the beginning. Built into the OWL language has a sophisticated set of semantics that can be applied to an ontology. As Ontologies grow in sophistication they will better and better represent reality, just as language grew in complexity and now tracks reality quite closely. This ability to make infrerred data explicit data is perhaps the most powerful and largely untapped capabilities of the semantic web.
By checking data and making inferences, the semantic web has the potential to calculate, to a much higher degree of accuracy, “information quality,” allowing the search engine to place the best information in the easiest to find location. This is of great importance to all of us beyond finding movie times for the latest Johnny Depp movie, for links are a currency on the web, and right now that is a currency with an unreliable value:
Being highly ranked is the end result of a complex algorithm that is often taken as a proxy for social importance. Seth Finkelstein(The Hyperlinked Society)
A search engine page-rank has a major impact on the searcher, with the higher ranked pages assumed by the public to be more important or of higher “social importance.” Early page-rank algorithms relied on the number of links to and from a site to calculate how high the page would show in searches(slightly resembling a popularity contest). In fact, this issue led to one of the first applications of “labeling the edge”(graph theory) or “microdata” which is the foundation of the semantic web: the nofollow attribute.
Early HTML standards included a “rel” attribute to be included in links. This rel attribute allows the document creator to define a few properties of a link to define what was at the “other end” of the link. With respect to search engines the “nofollow” attribute is specifically important as it allows one to link to a page that one doesn’t want to endorse.
Say I am writing a paper on racism, it may be useful to place a link to a hateful page, but I wouldn’t want that additional link to increase the page-rank of the page I am linking to. By adding the “rel= nofollow” attribute to the link I let the search engines know that I am not endorsing the page with this link. Link numbers and syntactic analysis(examining the symbols of the page, word-counts) have a use in search, in that they do collect a wide range of related pages, but they don’t put the highest quality information in the easiest to use location. These methods have also proven to be susceptible to manipulation by crafty web designers attempting to inflate their page-rank(Seth Finkelstein(The Hyperlinked Society)). The fact that page-rank has been manipulated is a testament to their social value. And when we remember the principle of least effort is is plain to see the importance of creating page-rank algorithms that more accurately represent information quality.
The semantic web provides a solution to this issue, but my raise other issues. By examining the meaning of the page instead of the syntax and link-count the search engine can rank higher the information it has fact checked against many databases and possibly against what it “knows” about you and your past habits(incorporating the facebook “graph search” will tell quite a bit about you). This brings up many issues regarding privacy and tracking. Luckily the W3C is working hard on incorporating user tracking preferences into the basic HyperTextTransferProtocal.
The first issue is a continuing narrowing of what information we are exposed to(the “fragmentation” and “polarization” described by James G Webster in “Structuring a Marketplace of Attention” The Hyperlinked Society) . The more a search engine calculates an information quality metric based on our past habits the more we will only see what we “want” to see. This may leave us isolated and only hearing what we want to hear and not being exposed to contrary viewpoints unless we seek them out ourselves.
The second issue this may raise is a kind of class structure which we are already seeing(Matthew Hindman, The Hyperlinked Society, What is the public sphere good for?). Those people and institutions that have the knowledge and means to tag their information with this metadata will appear to be of higher value. In most cases this will likely be a good thing(I get my movie times quickly), but it does leave the poor and unknowing person without a voice in this new public sphere, and without a voice of the poor, is this really a public sphere?
Related to the expertise of the content creator is the expertise of the content consumer. Current users of web technology currently have a poor understanding of what links mean, and through that mis-understanding they can be manipulated(Eszter Hargittai, The Hyperlinked Society). As the current generation grows up, will they develop sophistication with this technology, or will they continue to use without considering the impact? There will always be that portion of the population who doesn’t really understand what they are looking at. As we move forward and develop the semantic web we must recognize that fact and build the interface in a way that teaches them about what they are seeing. This seems to be a much less researched and examined portion of the issue. The current focus seems to be on developing the data structures(the architecture), but there are far fewer indications of what the semantic web will look like to the consumer and that might just be the most important component. Luckily, the structure is there. With computer-readable semantic information comes the ability to apply styling to the content to aid the viewer in understanding what they are seeing.
Without semantics, a link is a singular thing with a singular look that we all understand: blue underlined text. A mainstay of web creation is called CSS or Cascading Style Sheets. These allow a designer or consumer to separate the style from the structure. The HTML includes descriptive tags, and the CSS includes how content with specific tags should look(color, font, size, etc.). CSS can also be applied to content tagged with RDF information allowing designers to style the links based on their meaning. How this will look and how it will impact the content consumer is an open question waiting for the next generation of designers and researchers.
The data may even move beyond a text based presentation. And in some cases this will be necessary to understand the underlying data. Data organized in a graphs, like the entire semantic web, is particularly suited to graphical representation. Already, there are services designed to visualize data within the semantic web, but these are imposing products that require a highly-specialized skillset. For the semantic web to be a tool of the populace and not a tool of the elite these visualization and searching tools must be designed to be approachable. Likely we will need further descriptions written into RDF that describes to how specific classes of data are best visualized.
The fundamental aspect of the semantic web is the addition of metadata to our data. With metadata those that are designing the semantic web will have content to work with, and as we have seen we are already on the path. I have faith that the population will catch up to the technology, and that faith comes from an unlikely source: the hashtag craze.
Though the semantic web has been considered and planned for a long time now, it is only recently that the general population has caught on to need for metadata. For that is what hashtags in tweets and facebook messages are. Hashtags are ways to organize and relate our information. This craze was not a top-down design feature given to twitter users, it was developed by the users themselves(as was the @ as it is used on twitter) and is so popular that the population has pushed other service providers like Facebook to add the feature. Through hashtags the general population is discovering the importance of linked data.
IBM’s Watson computer presents us with a final example of the power of semantic computing. Watson, a computer, recently beat the best jeopardy players in the world and it did so using many of the semantic technologies described here. The computer was not connected to the internet while competing in the show, but it did contain the complete contents of wikipedia including dbpedia, the semantic database built from wikipedia. To understand the language of the Jeopardy! questions Watson relied heavily on Princeton's Wordnet an application of these techniques to the english language. This technology is being explored in Health Care situations with good results:
According to Samuel Nessbaum of Wellpoint, Watson’s diagnostic accuracy rate for lung cancer is 90%. In comparison, the average diagnostic accuracy rate for lung cancer for human physicians is only 50%. -(Qmed.com)
From movie-times to cancer diagnoses the semantic web will be a significant portion of our future. It is my conclusion that the data, standards, and the structure that this will require is well understood and is actively being developed, but the human interface and social effect of this technology still an open question that needs to be addressed by researchers and designers.
So, what is this “Web in which machine reasoning will be ubiquitous and devastatingly powerful” that Tim Berner’s Lee imagines? Well, I imagine a future where scholarly proof is a navigable space representing the entire issue seen through multiple perspectives. This space will be built on the graph-based architecture conceived by Tim-Berners Lee with search technologies and interfaces designed according to natural human tendencies like the Principle of Least Effort. I imagine being able to achieve a comprehensive understanding of an issue in a space where there is so much linked data that perspective becomes a dimension along which I can travel. A space where the signs and signifiers of all connected systems can mingle freely and the highest quality information and experiences are the easiest ones to reach.
Numerous resources linked within the document
GODEL, ESCHER. "Bach: An eternal golden braid." Douglas R. Hofstadter.(Originally published by Basic Books, 1979.) Basic Books (1999).
Tsui, Lokman, Joseph Turow, and Joseph Turow. The Hyperlinked Society: Questioning Connections in the Digital Age (The New Media World). Digital Culture Books-Imprint of: The University of Michigan Press, 2008.
Fisher, Karen E., Sanda Erdelez, and Lynne McKechnie, eds. Theories of information behavior. Information Today, Inc., 2005.