Tuesday, June 30, 2009

Open Database License is Released and My Brow is Sweating

Engineers and technologists generally resent needing to know anything about the law, because most often the lawyers are telling them they can't do something for some inane reason. For their part, many lawyers are surprisingly interested in technical matters, but even the most technically informed lawyers resent having to acknowledge that technology often trumps the law into irrelevance.

Today's announcement by the Open Knowledge Foundation of the release of version 1.0 of the Open Database License (ODbL) will create resentment in both professions- information technologists need to understand some of its complications, and lawyers will need to understand some technological limits of the license. In this post, I will try to articulate what some of the hard bits are.

The goal of the ODbL is to provide a means in which databases can be made widely available on a share-alike basis. Suppose for example, that you spent a lot of time assembling a cooperative of volunteers to compile a database of conferences and their hashtags. If you then made it available under the ODC Public Domain Dedication and License (PDDL), a commercial company could copy the database and begin competing with your cooperative without being obliged to contribute their additions and corrections to your effort. Under a share-alike arrangement, they would be obligated to make their derivative work available under the same terms as the original work. So-called "copyleft" licenses with share-alike provisions have proven to be very useful in software as the legal basis for Open Source development projects.

The difficulty with applying copyleft licenses to databases is that open source licenses that implement them are fundamentally rooted in copyrights which cannot easily applied to databases, hence the need for the work of the Open Knowledge Foundation. Usually, databases are collections of facts, and you can't use copyright to protect facts. However, it gets more complicated than that. In the US, it's also not possible to copyright collections of facts which can in fact be copyrighted in Europe under the "Sweat of the Brow" doctrine.

So copyright protection (and thus licenses including GPL and Creative Commons) can be asserted on entire dataspaces, but that protection is invalid in the United States. What the ODbL does to address that issue is to invoke contract law to paper over the gaps created by international non-uniformity of copyright for databases. The catch is that contract law, and thus the share-alike provisions it carries in the ODbL, can only be enforced if there is agreement to the license by the licensee. That's the thing that causes such user-experience monstrosities as click-through licenses and the like. So pay attention, engineers and technologists (Linked Data people, I'm talking to you!), if the provisions of the ODbL are want you want, you'll also need to implement some sort of equivalent to the click-through license. If you expect to involve machines in the distribution of data, you'll need to figure out how to ensure that a human is somewhere in the chain so they can consent to a license, or at least you'll need to socialize the expectation of a license.

Pay attention too, you legal eagles. Be aware that mechanisms of the click-through ilk can be prohibitively expensive if implemented without thought about the full system design. The most valuable databases may have hundreds of millions of records and can be sliced and diced all sorts of ways, so you want to avoid doing much on a record-by-record basis. Also be aware that databases can be fluid. Legitimate uses will mix and link multiple databases together, and the interlinks will be a fusion that should not be judged a derived work of either of the source databases. Records will get sent all over and recombined without anyone being able to tell that they came from a database covered by ODbL.

Any organization considering the use of ODbL should study the criticisms of ODbL posted by the Science Commons people. My own view is that there are lots of different types of databases with different characteristics of size, application, and maintenance effort. ODbL provides an important new option for those situations where neither PDDL or a conventional proprietary license will maximize benefits to the stakeholders in the database. But most of all, technologists and engineers need to consider the requirements needed for successful open licensing early on in the development of database distribution infrastructure.

More on Linked Data Business models to come.
Reblog this post [with Zemanta]

Friday, June 26, 2009

Why the Times took 8 days to Announce its Linked Data Announcement

It took 8 full days for the New York Times to make the same announcement on its "Open" blog that it made last week at the Semantic Technology Conference. Being that it's Friday afternoon, I present here my purely hypothetical speculations on what took so long, based on reading of tea leaves and semantic hyperparsing of the subtle, almost hidden differences between today's text and a transcription of the announcement of last Wednesday.
  1. A pitched battle between entrenched factions within the New York Times has waged over the past week, pitting a radical cabal of openists versus the incumbent "we've always done it that way" faction. The openists slipped the announcement of the announcement into their blog while the traditionalists were occupied with the battle over the type size of the headline for the Michael Jackson story today.
  2. The TimesOpen team missed last week's deadline for the "Sunday Styles" Announcements section.
  3. Normally, announcements like these take two weeks to process, but the business section was starting to get worried that USAToday was going to scoop them with a front pager on Monday.
  4. The written announcement was held up because a patent lawyer feared that the admission that the Thesaurus was "almost 100 years" old could hurt the Times' efforts to obtain a patent on the semantic web.
  5. The announcement was actually made last Thursday, but the printf() command in the Blog's subtitles crashed some key RSS syndication agents.
  6. The fact checker was on vacation.
If you have ever worked in an organization of even moderate size, you know that the real reason is almost certainly banal and boring.

On a more serious note, I think it's important to understand how organizations (not just the New York Times) adapt their internal processes to enable semantic technology in general. Over the past 15 years, the necessity to produce a web site has required many organizations to overhaul many of their internal processes, resulting in new efficiencies and capabilities that go well beyond the production of a website. At last weeks Semantic Technology Conference, there were a number of presentations that solved problem X using semantic technologies, raising immediate questions about what was so wrong with solving problem X the conventional way. Implicit in the presentations was an assumption that by approaching problems using semantic techniques, one could achieve a level of interoperability and software reuse that is not being achieved with current approaches. That's a sales pitch that's been made for many other technologies. What is certainly true is that many problems that are causing pain these days can only be solved by reengineering of corporate processes; maybe semantic technologies will be a catalyst for this re-engineering, at least in the publishing industry.

A very thoughful review of last weeks conference has been posted by Kurt Cagle. I leave you with this quote from Kurt:
There comes a point in most programmers careers where they make a startling realization. Computer programming has nothing to do with mathematics, and everything to do, ultimately, with language. It’s a sobering thought.
A reassuring thought as well.

Thursday, June 25, 2009

The Bilbo Baggins of the Semantic Web

It's not at every conference that you encounter a Tolkein character. So when it happened to me at last week's Semantic Technology Conference I knew that the conference was something special.

On Wednesday, at the end of the plenary, I notice that the guy sitting in from of me had circled an abstract that had intrigued me. When I had read the abstract in the morning, it occurred to me that the the talk could be really good or really bad. I asked if he knew what it was about he said "not really", and we agreed that it might be prudent to sit in the back in case it turned out to be dreadfully dull. Here is the abstract:
The "A-tree" - A Conceptual Bridge between Conventional Databases and Semantic Technology
Harry Ellis
Babel-Ease Ltd.

In this talk Harry will introduce and demonstrate an embryonic data storage and distribution product that has the potential to form a universal bridge between semantic applications and all types of stored data wherever located. The key innovation is that all types of both application and metadata are expressed as independent assertions (enhanced RDF triples) where the subject is regarded as parent within a single tree structure (A-tree). Lineage through this parental path carries all inheritance and other mandatory relationships as determined by the predicate which is also defined within the same tree.
After coffee, I made my way to the appointed room, passing a woman who was studying the same abstract on the conference schedule board. I told her it looked interesting to me, and she thought my back-row plan sounded sensible. Which plan soon revealed itself to be flawed, as we found that the room was almost standing-room only. I squeezed myself into one of the last remaining seats, and resigned myself to possibly being both bored and squashed for the duration.

I was not bored. Harry Ellis turned out to be a semi-retired veteran of the entity modeling wars. He worked for the British Army for 12 years in the field of "battlespace information management" and developed a language used for the semantic modeling of "dynamic information across a complex enterprise". I seriously do NOT want to know what that's a euphemism for, but I assume it has something to do with the forges of Mt. Doom. Harry has been working 10 hours a week on this "self-funded research" from his home, "Little Twitchen" in Hobbiton Devon for the last 5 years.

Harry's talk was oozing in soundness. I found myself agreeing with just about everything that he said, and based on the audience reaction, I was not alone. You can go visit his website (don't neglect the sitemap) and get a pretty good feel for what he's doing, but the website's not been updated in a while, and he says a new website, with software implementing his vision, will be available soon at http://a-tree.info. An extra session was scheduled at SemTech 2009 for Harry to show his creation in action, but alas, I was not able to attend.

Harry has looked at the existing sementic web infrastructure and has found a number of problems. They are ( my summary, don't blame Harry):
  • a profusion of ontologies that don't talk to each other
  • difficulty to resolve an entity when there may be many duplicates
  • difficulty in handling information that changes rapidly
  • difficulty in tracking the provenance of information
His solutions to these problems are very well thought out.
  • he suggests that all semantic web entity classes should build on a single global class, just as all Java Objects inherit from java.lang.Object, and that there should be clearly defined mechanisms based on properties to derive new classes from previously defined classes. This is what he calls the "A-tree". (When you hear about about making entities from trees, I challenge you not to think of Treebeard>)
  • he suggests that the units of information should be self contained "assertions" which include provenance and context information, rather than RDF triples or graphs. They should be "indivisible and immutable versions which are semantically complete and have their own provenance".
  • he proposes a publish and subscribe mechanism to make sure that current information is distributed to agents that need it.

The aspect of the A-tree proposal that will be received with most skepticism will be its one-ontology-to-rule-them-all orientation. People working on the Semantic Web are fond of their ontologies, and the thought of possibly needing to revise them all to join with the A-tree is hard to swallow. I don't know for sure whether that is in fact the case, but I can imagine ways that that necessity might be avoided.

What strikes me is that everything about the A-tree proposal seems so well worked out and sensible. I was also impressed at how Harry has tried to imagine an ecosystem in which the creation of ontologies could be accessible outside the priesthood of ontologists. As I've argued here, I believe that the Semantic Web must be defined as the establishment of a social construct and practice, and that the tools for particpation in the Semantic web need to be accessible to a wide range of users.

My guess is that Harry's work will be most useful to organizations that want to particpate in the Semantic Web, but who are serious about ensuring that the information that they receive and supply is always current and can be traced back to its source, either to judge its reliability, to build website traffic, or just to satisfy lawyers that disclaimers of liability can be reliably attached to the "grains of information" that they hope will find fertile soil to grow new understanding. If I were working to help the New York Times enter the data cloud, this is the sort of infrastructure I would want.

Now if only I can figure out which conferences Gandalf attends...

Monday, June 22, 2009

The New York Times and the Infrastructure of Meaning

The big announcement at last week's Semantic Technology Conference came from the New York Times. Rob Larson and Evan Sandhaus announced that the New York Times would be releasing its entire thesaurus as Linked Data sometime soon (maybe this year). I've been very interested in looking at business models that might support the creation of maintenance of Linked Data, so I've spent some time thinking about what the New York Times is planning to do. Rob's presentation spent a lot of time evoking the history of the New York Times, and tried to make the case that the Adolph Ochs' decision in 1913 to publish an index to the New York Times played a large part in the paper's rise to national prominence as one of the nation's "Newspapers of Record". The decade of that decision was marked by an extremely competitive environment for New York newspapers- the NYT competed with a large number and variety of other newspapers. I don't know enough about the period to know if that's a stretch or not, but I rather suspect that the publication of the index was a consequence of a market strategy that proved to be successful rather than the driver of that strategy. The presentation suggested a correspondence between the decade of the 1910's and our current era of mortal challenges to the newspaper business. The announcement about linked data was thus couched as a potentially pivotal moment in the paper's history- by moving decisively to open its data to the semantic web, the New York Times would be sealing its destiny as a cultural institution integral to our society's infrastructure of meaning.

The actual announcement, on the other hand, was surprisingly vague and quite cautious. It seems that the Times has not decided on the format or the license to be used for the data, and it's not clear exactly what data they are planning to release. Rob Larson talks about the releasing the "thesaurus", and about releasing "tags". These are not the terms that would be used in the semantic web community or in the library community. A look at the "TimesTags API" documentation gives a much clearer picture of what Rob means. Currently, this API gives access to the 27,000 or so tags that power the "Times Topics" pages. Included as "tags" in this set are
  • 3,000 description terms
  • 1,500 geographic name terms
  • 7,500 organization name terms
  • 15,000 person name terms
The Times will release as linked data "hundreds of thousands" of tags dating back to 1980, then in a second stage will release hundreds of thousands more tags that go back to 1851. They want to community to help normalize their tags, and connect them to other taxonomies. According to Larson, "the results of this effort, will in time, take the shape of the Times entering (the linked) data cloud." I presume this to meant that the Times will create identifiers for entities such as persons, places, organizations, and subjects, and make these entities available for others to use. Watch the announcement for yourself:

I've found that it's extremely useful to think of "business models" in terms of the simple question "who is going to write the checks?" The traditional business model for newspapers has been for local advertisers and subscribers to write the checks. Advertisers want to write checks because newspapers deliver localized aggregates of readers attracted by convenient presentations of local and national news together with features such as comics, puzzles, columns and gossip. Subscribers write checks because the the paper is a physical object that provides benefits of access and convenience to the purchaser. Both income streams are driven by a readership that finds reading the newspaper to be an important prerequisite to full participation in society. What Adolph Ochs recognized when he bought control of the Times in 1896 was that there was an educated readership that could be attracted and retained by a newspaper that tried to live up to the motto "All the news that's fit to print". What Ochs didn't try to do was to change the business model.

The trials of the newspaper industry are well known, and the business model of the New York Times is being attacked on all fronts. Newspapers have lost their classified advertising business because Craigslist and the like serve that need better and cheaper. Real estate advertising has been lost to Zillow and the online Multiple Listing Service. The New York Times has done a great job of building up its digital revenue, but the bottom line is that hard news reporting is not as effective an advertising venue as other services such as search engines. Subscribers, on the other side, are justifiably unwilling to pay money for the digital product, because the erection of toll barriers makes the product less convenient rather than more convenient. Nonetheless, the digital version of the New York Times retains the power to inform its readership, a power that advertisers will continue to be willing to pay for. It's also plausible that the New York times will be able to provide digital services that some subscribers will be willing to pay for. So, assuming they don't go bankrupt, the business model for the future New York Times does not look qualitatively different from the current model (at least to me), even if the numbers are shifting perilously in the near future.

So let's examine the stated rationales for the New York Times to join the Linked Data community, and how they might help to get someone to send them some checks. The first and safest stated rationale is that by entering the linked data cloud, traffic to the New York Times website will increase, thus making the New York Times more attractive to advertisers. So here's what puzzles me. What Rob Larson said was that they were going to release the thesaurus. What he didn't say was that they were also going to release the index, e.g. the occurrence of the tags in the articles. Releasing the index together with the thesaurus could have a huge beneficial impact on traffic, but releasing the thesaurus by itself will leave a significant bottleneck on the traffic increase, because developers would still have to use an API to get access to the actual article uri's. More likely, most developers who want to access article links would try to use more generic api such as those you'd get from Google. Why? If you're a developer, not so many people will write you checks for code that only works with one newspaper.

I would think that publication of occurrence coding would be a big win for the NYT. If you have articles that refer to a hundred thousand different people, and you want people interested in any of those people to visit your website, it's a lot more efficient for everyone involved (and a lot less risk of "giving away the store") for you to publish occurrence coding for all of these people than it would be for everyone who might want to make a link to that article to try to do indexing of the articles. The technology behind Linked Data, with its emphasis on dereferencable URI's, is an excellent match to business models that want to drive traffic via publication of occurrence coding.

Let's look at the potential costs of releasing the index. Given that the Times needs to produce all of the occurrence data for its website, the extra cost of releasing the linked data for the index should be insignificant. The main costs of publishing occurrence data as Linked Data are the risks to the Times' business model. By publishing the data for free, the Times would cannibalize revenue or prevent itself from being able to sell services (such as the index) that can be derived from the data, and in this day and age, the Times needs to hold onto every revenue stream that it can. However, I think that trying to shift the Times business model towards data services (i.e. selling access to the index) would a huge risk and unlikely to generate enough revenue to sustain the entire operation. Another serious risk is that a competitor might be able to make use of the occurrence data to provide an alternate presentation of the Times that would prove to be more compelling than what the Times is doing. My feeling is that this is already happening to a great extent- I personally access Times articles most frequently from my My Yahoo page.

The other implied rationale for releasing data is that by having its taxonomy become part of Linked Data infrastructure, the New York Times will become the information "provider of record" in the digital world the way the index helped it become one of the nation's "newspapers of record". The likelihood of this happening seems a bit more mixed to me. Having a Times-blessed set of entities for people, places and organizations seems useful, but in these areas, the Times would be competing with more open, and thus more useful, sets of entities such as those from dbpedia. For the Times to leverage its authority to drive adoption of its entities, it would have to link authoritative facts to its entities. However, deficiencies in the technology underlying linked data make it difficult for asserted facts to retain the authority of the entities that assert them. Consider a news article that reports the death of a figure of note. The Times could include in the coding for that article an assertion of a death date property for the entity corresponding to that person. It's complicated (i.e. it requires reification) to ensure that a link back to the article stays attached to the assertion of death date. More likely, the asserted death date will evaporate into the Linked Data cloud, forgetting where it came from.

It will be interesting to see how skillful the Times will be in exploiting participation in linked data to bolster its business model. I'll certainly be reading the Times' "Open" blog, and I hope, for the Times' sake that the go ahead and release occurrence data along with the thesaurus. The caution of Rob Larson's announcement suggests to me that the Times is a bit fearful of what may happen. Still, it's one small step for a gray lady. One giant leap for grayladykind?
Reblog this post [with Zemanta]

Thursday, June 18, 2009

Triple Stores Aren't

Once a thing has acquired a name, it's rare that it can escape that name even if the underlying concept has changed so as to make the name inaccurate. Only when the name causes misunderstandings will people start to adopt a new, more accurate name. I am trained as an engineer, but I know very little about engines; so far that has never caused any problems for me. It's sometimes funny when someone worries about getting lead poisoning from a pencil lead but it doesn't cause great harm. It's no big deal that there's hardly any nickel in nickels. Columbus "discovered" the "indians" in 1492; We've known that these people were not in India for a long time, but it's only recently that we've started using the more respectful and more accurate term "Native Americans".

I'm going to see some old friends this evening, and I'm sure they'll be pretty much how I remember them, but I'll really notice how the kids have grown. That's what this week has been like for me at the Semantic Technology Conference . I've not really worked in the semantic technology area for at least 7 years (though I've been making good use of its ideas), but a lot of the issues and technologies were like old friends, wiser and more complex. But being away for a while makes me very aware of things have changed- things that people who have been in the field for the duration might not have been conscious of, because the change has occurred gradually. One of the things I've noticed also involves a name that's no longer accurate. It might confuse newcomers to the field, and may even cause harm by lulling people into thinking they know something that isn't true. It's the fact that triple stores are no longer triple stores.

RDF (subject,predicate,object) triples are the "atom" of knowledge in a semantic-technology information store. One of the foundational insights of semantic technology is that there is great flexibility and development efficiency to be gained by moving data models out of relational database table designs and into semantic models. Once you've done that, you can use very simple 3-column tables to store the three pieces of the triples. You need to do much more sophisticated indexing, but it's the same indexing for any data model. Thus, the triple store.

As I discussed in my "snotty" rants on reification, trying to rely on just the triples keeps you from doing many things that you need to do in many types of problems. It's much more natural to treat the triple as a first-class object, either by reification or by objectification (letting the triple have its own identifier). What I've learned an this conference is that all the triple stores in serious use today use more that 3 columns to store the triples. Instead of triples, RDF atoms are now stored as 4-tuples, 5-tuples, 6-tuples or 7-tuples.

Essentially all the semantic technology information stores use at least an extra column for graph id (used to identify a graph that a particlar triple is part of). At the conference, I was told that this is needed in order to implement the contextual part of SPARQL. (FROM NAMED, I assume. Note to self: study SPARQL on the plane going home!) In addition, some of the data stores have a triple id column. In a post on the Freebase Blog, Scott Meyer reported that Freebase uses tuples which have IDs, 6 "primitives "and a few odds and ends" to store an RDF "triple" (the pieces which stor the triple are called left, right, type and value). Freebase is an append-only data store, so it needs to keep track of revisions, and it also tracks the creator of the tuple.

Is there anything harmful with the misnomerization of "triple", enough for the community to try their best to start talking about "tuples"? I think there is. Linked Data is the best example of how a focus on the three-ness of triples can fool people into sub-optimal implementations. I heard this fear expressed several times during the conference, although not in those words. More than once, people expressed concern that once data had been extracted via SPARQL and gone into the Linked Data cloud, there was no way to determine where the data had come from, what its provenance was, or whether is could be trusted. He was absolutely correct- if the implementation was such that the raw triple was allowed to separate from its source. If there was a greater understanding of the un-three-ness of real rdf tuplestores, then implementers of linked data would be more careful not to obliterate the id information that could enable trust and provenance. I come away from the conference both excited by Linked Data and worried that the Linked Data promoters seemed to brush-off this concern.

I'll write some more thoughts from the conference tomorrow, after I've googled a few things with Bing.

Wednesday, June 17, 2009

Siri and the Big Head Cases

A year and a half ago, I had some business in London. I had recently acquired an iPhone, and I was enjoyed showing it to people because it was not yet available in England. On my day off, I got a map from the hotel and set out to wander the city. What I discovered was that when I needed do some navigation I reflexively pulled out the iPhone to look at the Google maps app, despite the fact that the paper map was in my other pocket, it was faster and more convenient to use the paper map, and for the roaming data charges I was running up on the iPhone, I could easily have bought a whole guidebook. Once I had a "workflow" that had been previously rewarding, I would repeatedly reuse that workflow, even when it was inappropriate.

The most exciting thing at the Semantic Technology Conference this week was, for me, the keynote address by Tom Gruber, Founder and CTO of Siri. He demonstrated the "Virtual Personal Assistant" that they have been working on. It will be released as a free service complete with iPhone app this summer, and I can't wait to try it out. (To get your name on the beta list, go to http://www.siri.com/registration). Gruber articulated a theme that was repeated several times during the conference, that the web world has resulted in a fragmented user experience. Those Apple commercials that pound home the message "There's an app for that" now have taken on a new meaning for me. It seems that while I can go to my iPhone for just about anything I want to do when I'm mobile, the proliferation of apps has made may user experience much more complex- I have to chose the app I want to use for a particular task, and I sometimes need to orchestrate several apps to accomplish a task. Siri's virtual personal assistant is designed to be the one button I use to accomplish a variety of tasks, and I'm guessing that I will be reflexively using Siri even when I shouldn't.

The other interesting perspective I got from Gruber's talk was the concept he awkwardly called "The Big Head Cases". Siri has tried to focus on the few things that users do most often- get directions, find a restaurant, communicate with friends, and handle these things really well using speech recognition, natural language processing, contextual awareness, service orchestration, etc. This is the opposite of the stereotypical internet focus on the "long tail". "What's the opposite of the long tail?" Gruber asked himself. "I guess it's the 'big head'." In other contexts, you might call this the 10/90 rule. Siri will be addressing the "long tail" tasks by opening up an API that will let 3rd parties expose their services through the virtual digital assistant. The library world should sit up and take note- there will need to be a good way for libraries and other information services to offer their location-specific services in this way.

Is Semantic Web Technology Scalable?

"Scalable" is a politician of a word. It has attractiveness to obtain solid backing from diverse factions- it has something to offer both the engineericans and the marketerists. At the same time it has the dexterity to mean different things to different people, so that the sales team can always argue that the competition's product lacks "scalability". The word even supports multiple mental images- you can think of soldiers scaling a wall or climbers scaling a mountain; a more correct image is that of scaling a picture to making it bigger. Even technology cynics can get behind the word "scalable": if a technology is scalable, they would argue, that means it hasn't been scaled.

The fact is that scalability is a complex attribute, more easily done in the abstract than in the concrete. I've long been a cynic about scalability. A significant fraction of engineers who worry about scalability end up with solutions that are too expensive or too late to meet the customer problems at hand, or else they build systems that scale poorly along an axis of unexpected growth. Another fraction of engineers who worry too little about scalability get lucky and avoid problems by the grace of Moore's Law and its analogs in memory storage density, processor power and bandwidth. On the other hand, ignorance of scalability issues in the early phases of a design can have catastrophic effects if a system or service stops working once it grows beyond a certain size.

Before considering the scalability of the Semantic Technology, let's define terms a bit. The overarching definition of scalability in information systems is that the resources needed to solve a problem should not grow much faster than the size of the problem. From the business point of view, it's a requirement that 100 customers should cost less to serve than 100 times what it would cost to serve one customer (the scaling should be less than linear). If you are trying to build a Facebook, for example, you can tolerate linear scaling in number of processors needed per million customers if you have sublinear costs for other parts of the technology or significantly superlinear revenue per customer. Anything superlinear will eventually kill you. If there are any bits of your technology which scale quadratically or even exponentially, then you will very quickly "run into a brick wall".

In my post on curated datasets, I touched on an example where a poorly designed knowledge model could "explode" a semantic database. This is one example of how the Semantic Web might fail the scalability criterion. My understanding of the model scaling issue is that it's something that can be addressed, and is in fact addressed in the best semantic technology databases. The semantic analysis component of semantic technology can quite easily be parallelized, so that appears to pose no fundamental problems. What I'd like to address here is whether there are scalability issues in the semantic databases and inference engines that are at the core of Semantic Web technology.

Enterprise-quality semantic databases (using triple-stores) are designed to to scale well in the sense that the number of RDF triples they can hold and process scales linearly with the amount of memory available to the CPU. So if you have a knowledge model that has 1 Billion triples, you just need to get yourself a box with 8GB of RAM. This type of scaling is called "vertical scaling". Unfortunately if you wanted to build a Semantic Google or a Semantic Facebook, you would probably need a knowledge model with trillions of triples. You would have a very hard time to do it with a reasoning triple store, because you can't buy a CPU with that much RAM attached. The variety of scaling you would want to have to solve a bigger problems is called "horizontal scaling". Horizontal scaling distributes a problem across a farm of servers, and the scaling imperative is that the number of servers required should scale with the size of the problem. At this time, there is NO well-developed capability for semantic databases with inference engines to distribute problems across multiple servers. (Mere storage is not a problem.)

I'll do my best to explain the difficulties of horizontal scaling in semantic databases. If you're an expert in this, please forgive my simplifications (and please comment if I've gotten anything horribly wrong.) Horizontal scaling in typical web applications uses partitioning. Partitioning of a relational database typically takes advantage of the structure of the data in the application. So for example, if you're building a Facebook, you might chose to partition your data by user. The data for any particular user would be stored on one or two of a hundred machines. Any request for your information is routed to the particular machine that holds your data. That machine can make processing decisions very quickly if all your data is stored on the same machine. So instead of sharing one huge Facebook web application with 100 million other Facebook users, you might be sharing one of a hundred identical Facebook application servers with "only" a million other users. this works well if the memory size needed for 1 million users is a good match to that available on a cheap machine.

In a semantic (triplestore) database, information is chopped up into smaller pieces (triples) with the result that much of information will be dispersed into multiple pieces. A partitioned semantic database would need to intelligently distribute the information across machines so that closely related information will reside on the same machine. Communication between machines is typically 100 times slower than communication within the same machines, so the consequences of doing a bad job of distributing information can be disastrous. Figuring out how to build partitioning into a semantic database is not impossible, but it's not easy.

I'm getting ahead of myself a bit, because a billion triples is nothing to sneeze at. Semantic database technology is exciting today in applications where you can put everything on one machine. But if you read my last post, you may remember my argument that the Semantic Web is NOT loading information into a big database of facts. It's a social construct for connections of meaning between machines. Current semantic database technology is designed for reasoning on facts loaded onto a single machine; it's capable of building semantic spaces up to a rather large size; but it's not capable of building a semantic Google, for example.

I've learned a lot at the Semantic Technology Conference around this analysis. What I see is that there is a divergence in the technologies being developed. One thread is to focus on the problems that can be addressed on single machines. In practice, that technology has advanced so that the vast majority of problems, particularly the enterprise problems, can be addressed by vertically scaled systems. This is a great achievement, and is one reason for the excitement around semantic technologies. The other thread is to achieve horizontal scaling by layering the semantic technologies on top of horizontally scaled conventional database technologies.

I've been going around the conference provoking people into interesting conversations by asserting that there is no such thing (today) as Semantic Web Technology- there are only Semantic Technology and Web Technology, and combinations thereof. The answer to the question in the title is then that if there was such a thing as Semantic Web technology, then it would be scalable.

Saturday, June 13, 2009

Semantic Web Asteism

It seems only fair that if I'm taking next week to go to the Semantic Technology conference in California, I should be able to explain to my wife why it is that I'm going. The problem I have is that she asks technology people razor-sharp questions for a living, so it doesn't work to try to fudge the explanations. Although I've been writing here about the "Semantic Web" for a while now, I've been fudging about what it is I'm writing about and why I think it's important.

There are two words to talk about. The first is "semantic". Hidden within the word semantic is an asteism (word of the day: an asteism is a polite irony; a backhanded compliment). It's the same asteism inherent in the whole field of "artificial intelligence". To use the term "artificial intelligence" is to imply that machines are stupid. "Semantics" is the study of meaning, and to use the term "semantic web" seems to imply that the plain ol' web is devoid of meaning. What is really meant is that the plain ol' web is meaningless to those stupid machines. So the core idea of the Semantic Web is not just that meaning can and should be published, consumed and reused, it's that meaning can and should be published, consumed and reused... by machines.

The second word to talk about is "web". Note that the word is not "space" or "universe" or "world", each of which would have meant something useful and interesting. It's "web", a set of points connected by threads, which can be traversed and which can catch things. It's another asteism about the monolithicity and lack of connectedness of the semantic technology of today. Distribution is inherent in the word "web"; loading information into a big datase of facts is not a web. But the implicit question is what are the points that are being connected? The implicit answer again is "machines", but that, I think, is missing the point. Machines are not interested in meaning; they act as proxies for entities that really are interested in meaning- people, organizations, businesses, governments, schools.

So here's my answer. The Semantic Web is to be a social construct for the automated, distributed, publication consumption and reuse of meaning.

Note that I say "is to be" rather than "is". It is not clear to me whether or not the Semantic Web exists in a working or even incipient form today. It is clear to me, however, that the the Semantic Web must first and foremost be a social construct.

Whenever you see a discussion of the semantic web, there tends to be a lot of discussion about technology- RDF, OWL, tuples, microformats, and things like that. I've come to realize that equating the Semantic Web with those things is like equating the Roman Empire with legions, triremes, siege engines, arches, and roads. The Roman Empire used these instruments to exert power, of course, but that's not how it spread thoughout the western world. The Roman Empire was a social construct. In exchange for accepting Roman dominion and Roman Law, societies obtained the benefits of culture, communication and commerce. (Of course, if they chose not to, they were enslaved, but let's not take the analogy too far!)

The Internet has acheived global dominion using a construct analogous to that of the early Roman Empire. Participation in the Internet Empire requires acceptance of some basic rules (articulated by the IETF rather than the Roman Senate) and the benefits of participation clearly outweigh the costs. Instead of Latin, we have HTML and HTTP. The benefits received - culture, communication and commerce, are exactly the same.

The Semantic Web, by contrast, is still searching for a way to make its social construct an obvious benefit to all of its participants. If meaning and knowledge are valuable, then people will not be motivated to participate in a construct that only enables the distribution of that value. In particular, entities that expend effort to build and maintain the largest and truest stores of meaning and knowledge will have little incentive to participate.

In my last post, I talked about Google Fusion Tables and highlighted the attention that the designers paid to collaboration, control, and attribution. The reason I was excited is that Google Fusion Tables helped me imagine a world in which the Semantic Web social construct could deliver clear value to its participants and become a pervasive benefit to everyone.

Friday, June 12, 2009

Linked Data vs. Google Fusion Tables

Wouldn't it be cool if you had an idea for a collection of data and there was a way you could set up the database and then invite people to contribute data to the collection, visualize the collection, re-use the collection, and link it to other collections of data?

I think it's really cool, and its also the idea behind Linked Open Data, an initiative of the W3C's semantic web activity. The Linked Data people have amassed an impressive array of datasets available in the RDF xml format, and by using a foundation of URIs as global identifiers, they've enabled these datasets to be linked together. I've been reading a really good explanation of how to publish linked data. But you know what? I have never actually made any data collections available via linked open data. It seems cool, but I'm not sure I would ever be able to get anyone to help me build a data set using Linked Open Data. Linked Open Data seems designed for machines, and there seems to be very little infrastructure that could help me collaborate with other people with common interests in building data sets.

This morning, I set up an online database for twitter conference hashtags, using the data I collected for my last posting about conference hashtags. I used a "pre-alpha" service from Google Labs called "Google Fusion Tables". If you have a gmail account, you can view the table yourself, export the data, and visualize it in various ways. If you email me, I'll authorize you to add records yourself. It would be nice if I could make the table visible to people without gmail accounts, but I assume that's what they mean by "pre-alpha".

Pre-alpha is a good description. I found two bugs in a half hour of working with Google Fusion Tables, but I got e-mail from the developers acknowledging the problems within an hour of reporting the problems, so I wouldn't be surprising if the bugs get squashed very rapidly. (For example, some cells had problems getting edits saved.) From the Linked Open Data point of view, Fusion Tables is very disappointing, as it doesn't seem to be aware (from the outside) of semantic web technologies.

My experience has been that technology is never the real problem, and that building social practice around the technology is always the key to making a technology successful. The Fusion Tables team appears to have looked at the social practice aspect very carefully. Every row in the database, and every cell in every row can be annotated with a conversation and attached to an authorship. These features seem to me to be fundamental requirements for building a collaborative database, and they're annoyingly hard to do using so-called semantic technologies.

The visualizations available hint at some possibilities for Fusion Tables. For example, records can be visualized on a map using geographical location. It's easy to imagine how visualizations and data-typing could be the carrot that gets data set creators to adopt globally known predicates. An ISBN data type could trigger joins to book related data, for example. Or perhaps a zoology oriented dataset could be joined via genus and species to organism-oriented visualizations. Fusion Tables doesn't expose any URI identifiers, but it's hard to say what's going on inside it. In stark contrast, Linked Data sites tend to really hit you in the face with URI's, and it's really hard to explain to people why URI's belong in their databases.

A lot of the posts I've seen on Fusion Tables seem to miss its focus on collaboration, and the types of social practice that it might able to support. Just as Wikipedia has created a new social practice around encyclopedia development and maintenance, there is a possibility that Fusion Tables may be able to engender a new and powerful social practice around the collaborative maintenance of record-oriented databases. If that happens, companies in the database development business could find themselves going the way of Encyclopedia Britannica and World Book in the not so far-off future.

Wednesday, June 10, 2009

Conference Hashtags Don't Evolve

Leigh Dodds suggested yesterday that someone should do an "evolutionary study" on successful and failed conference hashtags. Since I've been interested in the way that vocabulary propagates on networks, I decided to take up the idea and do a bit of data collection.

If you're not on Twitter, or haven't been to a conference recently, you may not have encountered the practice of hashtagging a conference. The hashtag is just a string that allows search engines to group together posts on a single topic. It's become popular to use twitter as a back-channel to discuss conference presentations while they happen (does anyone still use IRC for this?), or to report on things being discussed at a meeting. The hashtags can also be used on services such as Flickr. I'll be attending the Semantic Technology conference in San Jose next week, and there was a bit of back and forth about what the right hashtag should be, semtech09 or semtech2009. Someone connected to the conference asserted that the longer version was preferred, sparking Leigh's remark.

To collect data, I search Twitter for "conference" AND "hashtag" and compiled all the results from 7 and 8 days ago. That gave me a list of 30 conferences, and I searched for all tweets which included the hashtags for all these conferences. "MediaBistro Circus" (#mbcircus) was the most referenced of any of these, with 1500 tweets in all.

There was no evidence at all of anything resembling evolution. Almost all the hashtags appeared spontaneously and without controversy, each one apparently via the agency of an intelligent designer. I did not find a single example of multiple hashtags competing with each other in a "survival of the fittest" sort of way. I found two instances of dead-on-arrival hashtags, which were proposed once and never repeated. In only one case was there significant usage of an alternate hashtag- the America's Future Now conference appeared in 22 tweets as #afn09, compared to 949 tweets for #afn. Even in this case, there appeared to be no competition, as the #afn09 tweeters stuck with their hashtag.

One question posed by the initial query was whether "09" or "2009" was the correct convention. Although "09" was 2 to 3 times more popular than "2009" in hashtags, having no year indicator at all was twice as frequent as having a year. What was very clear, however, was that avoidance of unrelated hashtags was the clear preference of hashtag selectors. The most interesting examples of this were the "#cw2009" and "#cw09" hashtags used for ComplianceWeek Conference and CodeWorks Conference, respectively.

None of this is surprising in retrospect. It's quite easy to see if a hashtag is the right one- you just enter in a search to see what comes up before you put it in your tweet. If you can't find one that works as you want it to, most likely you will not use a hashtag at all. Conferences tend to be meetings of people who are connected to each other via common interests, and a small number of people tend to update frequently and be followed by large numbers of people with common interests. Conference goers also seem to be very motivated to propagate and adopt hashtags- hashtag announcements for conferences are quite frequently retweeted.

The behavior of people selecting hashtags is quite uniform. However, I've previously noted another meeting of semantic web folk that had trouble getting their hashtag straight. Perhaps people who develop taxonomies for a living are less likely to adopt other peoples' suggested vocabulary based on a feeling that their way is the best way, and thus are more likely to silo themselves. I've seen this phenomenon before- the phones at Bell Labs never seemed to work very well, and my wife's experience with computers at IBM was not exactly problem-free. The Librarian Conferences I've been to seem to have horrible classification systems. Perhaps one way to improve vocabulary propagation on the semantic web is to get rid of the ontologists!

Monday, June 8, 2009

Magic and Truth Maintenance

The human brain has an amazing ability to construct mental models of physical reality based on the information given to it by the senses. By hacking the machinery that does this model construction, magicians are able to trick the brain into building mental models that are not really impossible, and the result, magic, can be very enjoyable. My favorite example of this is illustrated by Penn & Teller in this video:

You know at some level that Teller has not been cut into three pieces, but your brain can't figure out how else to accept the visual facts it is presented with. Even when you are allowed to see how the trick is done in the second part of the video, your brain has trouble revising and maintaining the truth of the mental model it has constructed.

A recent article in Wired describes Teller's foray into neuroscience and is well worth a read and a watch. The cup and ball trick shown there in the video from "The View" is another illustration of your brain having trouble processing the additional facts of how the trick is actually done. it's much easier for the brain to cheat and suppose that balls can magically appear and disappear. Teller's work has some lessons to teach the semantic web as well, because building models and maintaining their truth in the face of new information is central to the functioning of semantic web reasoning engines, and is also really hard to do well. This is one of the things I learned in my discussion with Atanas Kiryakov of Ontotext.

Semantic software starts with a knowledge model, then adds information triples to build out its base of knowledge. More sophisticated software might add to its knowledge model depending on the new information to be added. For each triple, or fact, semantic software will use an inference engine to add more facts to its collection, thus building up its knowledge model. Maybe you want to add the fact that coconuts are brown. Depending on your knowledge model, your inference engine may already know that coconuts are fruits, and will want to add coconuts to the group of brown fruits. This layer of inference adds your ability to make statements about coconuts, but it adds complexity to the process of adding facts to your knowledge collection. This is why ingest speed is a relevant figure of merit for semantic databases.

Things get complicated if information that has been ingested needs to be changed or deleted. If that happens, the inference engine has to re-infer all the facts that it inferred when the original fact was entered, so that it can change or delete all the inferred facts. For example, suppose you remove the fact that the coconut is a fruit because a new classification calls it a nut. Then you have to remove the brown fruit fact as you add a brown nut fact. This can be quite difficult and may involve many queries on your entire knowledge collection. For some knowledge models may even be impossible to undo inferences, particularly if the database has not kept a record of the ingest process. Whereas the human brain is reasonably good at developing and revising models of reality, software inference engines know nothing of reality and can only determine the consistency of its collection of knowledge. Maintaining consistency in the face of changing data can be computationally impractical. This is one of the main reasons that you don't want to use a semantic database for applications that have lots of transactions.

If you are considering the use of semantic web technology for that new project of yours, you'll want to understand these points, because for all the deductive power you can gain by using a knowledge model, there's also a danger that your application may get bogged down by the burden of maintaining consistency. Machines don't know how to cheat the way your brain does. Magic is not an option.

Friday, June 5, 2009

When are you collecting too much data?

Sometimes it can be useful to be ignorant. When I first started a company, more than 11 years ago, I decided that one thing the world needed was a database, or knowledgebase, of how to link to every e-journal in the world, and I set out to do just that. For a brief time, I had convinced an information industry veteran to join me in the new company. One day, as we were walking to a meeting in Manhattan, he turned to me and asked "Eric, are you sure you understand how difficult it is to build and maintain a big database like that?" I thought to myself, how hard could it be? I figured there were about 10,000 e-journals total, and we were up to about 5000 already. I figured that 10,000 records was a tiny database- I could easily do 100,000 records even on my Powerbook 5400. I thought that a team of two or three software developers could do a much better job sucking up and cleaning up data than the so-called "database specialists" typically used by the information industry giants. So I told him "It shouldn't be too hard." But really, I knew it would be hard, I just didn't know WHAT would be hard.

The widespread enthusiasm for Linked Data has reminded me of those initial forays into database building. Some important things have changed since then. Nowadays, a big database has at least 100 million records. Semantic Web software was in its infancy back then; my attempts to use RDF in my database 11 years ago quickly ran into hard bits in the programming, and I ended up abandoning RDF while stealing some of its most useful ideas. One thing that hasn't changed is something I was ignorant of 11 years ago- maintaining a big database is a big, difficult job. And as a recently exed ex-physicist, I should have known better.

The fundamental problem of maintaining a large knowledgebase is known in physics as the second law of thermodynamics, which states that the entropy of the universe always increases. An equivalent formulation is that perpetual motion machines are impossible. In terms that non-ex-physicist librarians and semantic websperts can understand, the second law of thermodynamics says that errors in databases accumulate unless you put a lot of work into them.

This past week, I decided to brush off my calculus and write down some formulas for knowledgebase error accumulation so that I wouldn't forget the lessons I've learned, and so I could make some neat graphs.

Let's imagine that we're making a knowledgebase to cover something with N possible entities. For example, suppose we're making a knowledgebase of books, and we know there are at most one billion possible books to cover. (Make that two billion, a new prefix is now being deployed!) Let's assume that we're collecting n records at random, so for any entity, each record has a 1/N chance of covering any specific entity. At some point, we'll start to get records that duplicate information we've already but into the database. How many records, n, will we need to collect to get 99% coverage? It turns out this is an easy calculus problem, one that even a brain that has spent 3 years as a middle manager can do. The answer is:
Coverage fraction, f = [1- exp(-n/N)]
So to get 99% of a billion records, you'd need to acquire about 4.6 billion records. Of course there are some simplifications in this analysis, but the formula gives you a reasonable feel for the task.

I haven't gotten to the hard part yet. Suppose there are errors in the data records you pull in. Let's call the the fraction of records with errors in them epsilon, or ε. Then we get a new formula for the errorless coverage fraction, F
Errorless coverage fraction, F = [exp(-εn/N) - exp(-n/N)]
This formula behaves very differently from the previous one. Instead of rising asymptotically to one for large n, it rises to a peak, and then drops exponentially to zero for large n. That's right, in the presence of even a small error rate, the more data you pull in, the worse your data gets. There's also a sort of magnification effect on errors- a 1% error rate limits you to a maximum of 95% errorless coverage at best; a 0.1% error rate limits you to 99.0% coverage at best.

I can think of three strategies to avoid the complete dissipation of knowledgebase value caused by accumulation of errors.
  1. stop collecting data once you're close to reaching the maximum. This is the strategy of choice for collections of information that are static, or don't change with time.
  2. spend enough effort detecting and resolving errors to counteract the addition of errors into the collection.
  3. find ways to eliminate errors in new records.
A strategy I would avoid would be to pretend that perpetual motion machines are possible. There ain't no such thing as a free lunch.

Tuesday, June 2, 2009

Who's the Boss, Steinbrenner or Springsteen?

As I started playing with Mathematica when it first came out (too late for me to use it for the yucky path integrals in my dissertation), I just had to try Wolfram|Alpha. The vanity search didn't work; assuming that's what most people find, its probably the death knell for W|A as a search engine. Starting with something more appropriately nerdy, I asked W|A about "Star Trek"; it responded with facts about the new movie, and suggested some other movies I might mean, apparently unaware that there was a television show that preceded it. Looking for some subtlety with a deliberately ambiguous query, I asked about "House" and it responded "Assuming "House" is a unit | Use as a surname or a character or a book or a movie instead". My whole family is a big fan of Hugh Laurie, so I clicked on "character" and was very amused to see that to Wolfram|Alpha, the character "House" is Unicode character x2302, "⌂". Finally, not really expecting very much, I asked it about the Boss.

In New Jersey, where I live, there's only one person who is "The Boss", and that's Bruce Springsteen. If you leave off the "The", and you're also a Yankees fan, then maybe George Steinbrenner could be considered a possible answer, and Wolfram|Alpha gets it exactly right. Which is impressive, considering that somewhere inside Wolfram|Alpha is Mathematica crunching data. The hype around Wolfram|Alpha is that it runs on a huge set of "curated data", so this got me wondering what sort of curated dataset knows who "The Boss" really is. To me, "curated" implies that someone has studied and evaluated each component item and somehow I doubt that anyone at Wolfram has thought about the boss question

The Semantic Web community has been justifiably gushing about "Linked Data", and the linked datasets available are getting to be sizable. One of the biggest datasets is "DBpedia". DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. According to its "about" page, the dataset describes 2.6 million "things", and is currently comprised of 274 million RDF triples. It may well be that Wolfram Alpha has consumed this dataset and entered facts about Bruce Springsteen into its "curated data" set. (The Wikimedia Foundation is listed as a reference on its "the boss" page.) If you look at Bruce's Wikipedia page, you'll see that "The Boss" is included as the "Alias" entry in the structured information block that you see if you pull up the "edit this page" tab, so the scenario seems plausible.

Still, you have to wonder how any machine can consume lots of data and make good judgments about who is "The Boss". Wikipedia's "Boss" disambiguation page lists 74 different interpretations of "The Boss". Open Data's Uriburner has 1327 records for "the boss" (1676 triples, 1660 properties), but I can't find the Alias relationship to Bruce Springsteen. How can Wolfram|Alpha, or indeed any agent trying to make sense of the web of Linked Data, deal with this ever-increasing flood of data?

Two weeks ago, I had the fortune to spend some time with Atanas Kiryakov, the CEO of Ontotext, a Bulgarian company that is a leading developer of core semantic technology. Their product OWLIM is claimed to be "the fastest and most scalable RDF database with OWL inference", and I don't doubt it, considering the depth of understanding that Mr. Kiryakov displayed. I'll write more about what I learned from him, but for the moment I'll just focus on a few bits I learned about how semantic databases work. The core of any RDF- based database is a triple-store; this might be implemented as a single huge 3 column data table in a conventional database management software; I'm not sure exactly what OWLIM does, but it can handle a billion triples without much fuss. When a new triple is added to the triple store, the semantic database also does "inference". In other words, it looks at all the data schemas related to the new triple, and from them, it tries to infer all the additional triples implied by the new triple. So if you were to add a triples ("I", "am a fan of", "my dog") and ("my dog", "is also known as", "the Boss"), then a semantic database will add these triples, and depending on the knowledge model used, it might also add a triple for ("I", "am a fan of", "the Boss"). If the data base has also consumed "is a fan of" data for millions of other people, then it might be able to figure out with a single query that Bruce Springsteen, with a million fans, is a better answer to the question "Who is known as 'the Boss'" than your dog, who, though very friendly, has only one fan.

As you can imagine, a poorly designed data schema can result in explosions of data triples. For example, you would not want your knowledge model to support a property such as "likes the same music" because then the semantic database would have to add a triple for every pair of persons that like the same music- if a million people liked Bruce Springsteen's music, you would need a trillion triples to support the "likes the same music" property. So part of the answer to my question about how software agents can make sense of linked data floods is that they need to have well thought-out knowledge models. Perhaps that's what Wolfram means when they talk about "curated datasets".