The Committee of Public Safety

Losing Our Heads Since 1793

Discovery

leave a comment »

Yaarrr!

Yaarrr!

Every Internet success story can be summarized as:

  1. Collect information about stuff.
  2. Store stuff in one place.
  3. Let people look through the stuff.
  4. Enable interaction with the stuff they find.

Take Twitter. Please. The lure of Twitter is not messaging. If you want messaging just like Twitter, you could use a competing service like Identi.ca or you could deploy your own Twitter-like system using Laconi.ca. You can escape the need to scale the impossible scale.

Imagine. No more ugly Twitter timeouts. No more flakey service. Unfortunately, this would forgo Twitter’s main appeal: discovering other Twitter users. They’ve enabled the creation of a managed, centralized, searchable, API accessible database of people that other people want to connect with, 140 characters at a time. No other 140 character communications platform offers the same advantages.

Map

Map

Similarly, a social networking platform like Facebook or MySpace provides a single place to find people. It’s possible that every person on Facebook could create their own web presence by making a home page or starting a blog. However, while that would provide a place where, serendipitously, long lost friends could find them, the chances of finding those long lost friends are improved if you take the basic idea of the web and distill it down to being just about searching, connecting, and communicating with people you know or knew.

Google is even more primal. Google is about finding stuff on the Web. You type in a single search phrase, click on search, and behold the power of PageRank. Google is like an army of people who went out and spent millions of man hours finding stuff on the web and linking to it. PageRank takes this mass of links and pools in a single massive database. Things that a greater number of people found and linked to rise in the search ranks while things that few people pay attention to languish in the dark recesses of the Web. Google provides a single text field that can take you to the highest reach of human achievement and the utter depths of human depravity.

Discovery is the killer application of the Internet. The Internet is a mesh network. There is no single path or hierarchy on a mesh network topology like there is in a star, ring, or bus topology to guide discovery. Each node must discover other nodes because each node has at least the theoretical capacity to be a peer or a gateway for other nodes. If this discovery does not take place, there is no network.

Every node on the Internet asks for the location of other nodes. If a node is not on the same network as the requesting, the node asks the default gateway to find out the location. This gateway usually knows about routes to other networks already. If it does not, it asks its peers to give it the required information. The request bounces back and forth through the Internet until the desired node is found.

This may work for networks but how do humans find other nodes to interact with? In the early days of TCP/IP it was easy. You downloaded a file from a central location and that told you where all the other nodes of interest were. This file was simply a mapping of an IP address onto a human readable name i.e.:

10.10.10.10 foo

This worked until the Internet grew so large that maintaining and downloading a giant file was impractical. The solution was DNS, a centralized system where a hierarchy of nodes maintained a dynamic distributed but centrally accessible database of human readable names to IP address mappings. Human readable names were split into hierarchical namespaces. At the very top were top level domains like .com, .org. and .net. Next came subdomains like google or twitter. These were followed by hostnames like www. The order proceeds from right to left with TLD at the right. Following traditional filesystem hierarchy, it would run left to right e.g. com.google.www. However, things are backward on the Internet so it goes www.google.com. Whichever way the domain name goes, it’s easier for someone to use it (and remember it) than an IP address. In effect, DNS was the first Internet search engine, focused on finding computers on the Internet.

Most people encounter Internet hostnames in a Uniform Resource Locator (URL). This standard provides a uniform way to link to an Internet name. First comes a scheme like http:// or ftp:// which tells the computer which Internet protocol to use. Then comes the address or hostname, 10.10.10.10 or google.com. This can be followed by a path like /home/index.html and query strings or fragment identifiers. Tim Berners Lee, the inventor of the URL (and the World Wide Web), wishes it had a simpler format like http://com/google/www/home/index.html which is more logical but in a universe of HTML tag soup Tim Berners Lee regrets a lot of things.

Indeed Tim Berners Lee further addressed the problem of discoverability on the Web by leading an ongoing but so far futile exercise in ontology that seeks to create a Semantic Web. The Semantic Web seeks to provide standards to attach computer readable descriptions to Web resources. It’s basically the World Wide Web for your machine in the off chance it wants to browse the Web. Central to this idea is the notion of a triple. If you wanted to describe the author of http://www.w3.org/People/Berners-Lee/, Berners Lee’s home page, the triple would be:

  1. Subject: http://www.w3.org/People/Berners-Lee/
  2. Predicate: Author
  3. Object: Tim Berners Lee

Unfortunately, most explanations of Semantic Web technologies like RDF aren’t this concise. The best candidate for a Web wide roll out is RDFa, which allows the RDF format to be embedded in HTML. The other barrier is that as soon as the Semantic Web was rolled out, someone would find a way to spam it. This is the problem of metacrap.

The fragility of metadata is an important concern because much planning for improving the web (such as the semantic web) is predicated upon certain flavors of metadata becoming widely adopted and used with care — something which, according to [Cory Doctorow], will not and cannot happen.Doctorow’s seven insurmountable obstacles to reliable metadata are:

  1. People lie
  2. People are lazy
  3. People are stupid
  4. Mission Impossible: know thyself
  5. Schemas aren’t neutral
  6. Metrics influence results
  7. There’s more than one way to describe something

Other reasons that result in metadata becoming obsolete (crap) are:

  1. Data may become irrelevant in time
  2. Data may not be updated with new insights

This means search result will return outdated and incorrect data

This already happened with HTML meta tags which is one reason why search engines before Google were filled with all sorts of terrible results. In the beginning, media is touted as a way to finally bring high culture to the masses but, ultimately, it becomes just another way to find porn.

However, most people will never use RDF or Semantic Web technologies. Many people never even type URLs into their browser’s address bar. They use a search engine to find what they’re looking for. Most people don’t surf the Web on the raw technology of the Internet: they use an aggregator who presents a selected interpretation of the Internet which exposes selected Internet resources for the user to interact with. While the efforts of the teeming millions produce those resources, it is the aggregator who gathers the resources into something coherent. They enable discovery for the Internet user.

However, that’s not the ultimate goal. The “information distiller” who boils out superfluous Internet noise to arrive at the truly valuable data is the ultimate in discovery technology. In a recent piece, Adam Elkus provides an example of what sort of man-machine fusion an information distiller might be based on: John P. Sullivan’s Transaction Analysis Cycle (TAC):

“The Transaction Analysis Cycle is a pattern generator…centered on Analysis/Synthesis. Utilizing this framework, analysts can observe activities or transactions conducted by a range of actors looking for indicators or precursors…Individual transactions…have signatures that identify [them]…These transactions and signatures (T/S) can then be observed and matched with patterns of activity that can be expressed as trends and potentials (T/P), which can ultimately be assessed in terms of a specific actor’s capabilities and intentions (C/I). At any point, the analytical team can posit a hypothesis on the pattern of activity and then develop a collection plan to seek specific transaction and signatures that confirm or disprove its hypothesis. “

TAC may provide a model for better information distillation:

The essential element of TAC is the structured process by which the network develops information collection priorities. Truly crowdsourced TAC would mean more than just aggregation—TAC would help build greater qualitative understanding through analysis and synthesis. The network would actively synthesize information from the cloud, setting priorities about the kinds of “signatures” that must be observed, matched with patterns of activity into trends and potentials, and built into a collection plan that could prove or disprove the hypothesis created. Like Wikipedia, the model would marry the expertise and dedication of an administrative core with a mass of casual users. Collection, visualizations, and aggregation systems would be the processing tools for these networks. To be very clear, the purpose of visualization and aggregation systems would be as means rather than ends—tools to implement command concepts rather than conceive them.

Written by josephfouche

July 10, 2009 at 9:36 pm

Leave a Reply