Way back in 2000, Wired.com profiled a new – at least it was new at the time – “search engine” called Autonomy. Instead of relying on keyword text to identify relevant documents, Autonomy utilizes a Bayesian algorithm, a chain of probabilities that constantly updates itself to provide ever-more helpful suggestions to the user. Since the date of publication, Hewlett Packard has acquired Autonomy and repurposed the program for use with Big Data. While I don’t believe this application of Autonomy by any means its most interesting feature, it’s worth explaining.
Big Data is a dramatic, intial-caps term for the datasets that provide unique analytical possibilities when organized as a whole, but are difficult to manage with traditional data management applications without breaking the dataset down into smaller parts. Big Data comes in many forms, such meteorological, physical, or medical text data. It seems that Autonomy will likely be applied by HP to the vast reservoirs of what some call “unstructured” text data, natural language text written with no regard for whether or not a conventional search engine would find it navigable. This text is untagged and, more importantly, idiosyncratic. Autonomy’s strength lies in its application of Bayesian concepts to “understand” this idiosyncratic natural language text in ways that have, hitherto, only been accessible to human beings.
So what is idiosyncratic natural language? It’s the playful, wordplay/euphemism/sarcasm-laden manner of speaking that we so often use in everyday conversation and, more pertinent to the current topic, on the web. It’s the language found in memes like “ERMAHGERD A TERNUS BERL!” and Condescending Willy Wonka. It’s the type of language that traditional search engines have so much trouble with. We’ve encountered this already in class in our discussion of search engines. Think about the problems encountered by a web search that would not be encountered by a legal database search; as pointed out in our textbook, the technical (and thoroughly proofread) language used in the legal database is much more amenable to simple keyword searches than the often jumbled, sloppy, or creative English employed by amateur bloggers.
Instead of solving this problem with PageRank or hub-and-authority computations that simply augment the keyword seach, Autonomy solves this problem by discovering meaningful patterns in texts and “learning” them. These patterns are described probabilistically and added to the chain of probabilities already in use. The more a person uses Autonomy, the more accurately it “understands” the user and the better the search results get.
It seems to me that this software should have applications far more interesting than the management of Big Data (though that has its own exciting potential). As the founder of Autonomy, Michael Lynch, describes it, the power of Bayesian computing is in its ability to navigate uncertainty. This allows it to interact with the world in ways that linear computing never could, and more interestingly, in ways that we once thought uniquely human. Bayesian computing could help us understand our own mental processes more clearly: Maybe it could shed more light on the Gestalt processes that make vision perception possible. Maybe it could help us understand how the human mind understands erratic, imprecise spoken language laden with “ums”, “uhs”, and “likes”.
It seems to me that any computational technology that interfaces the real world in ways at all similar to our own presents us with opportunities to better understand how we ourselves interface the real world.