Showing posts with label RDFa. Show all posts
Showing posts with label RDFa. Show all posts

Wednesday, July 27, 2011

Liking Library Data

If you had told me ten years ago that teenagers would be spending free time "curating their social graphs", I would have looked at you kinda funny. Of course, ten years ago, they were learning about metadata from Pokemon cards, so maybe I should have seen it coming.

Social networking websites have made us all aware of the value of modeling aspects of our daily lives in graph databases, even if we don't realize that's what we're doing. Since the "semantic web" is predicated on the idea that ALL knowledge can be usefully represented as a giant, global graph, it's perhaps not so surprising that the most familiar, and most widely implemented application of semantic web technologies has been Facebook's "Like" button.

When you click a Like button, an arc is added to Facebook's representation of your social graph. The arc links a node that represents you and another node that represents the thing you liked. As you interact with your social graph via Facebook, the added Like arc may introduce new interactions.

Google must think this is really important. They want you to start clicking "+1" buttons, which presumably will help them deliver better search. (You can try following me+, but I'm not sure what I'll do with it.)

The technology that Facebook has favored for building new objects to but in the social graph is derived from RDFa, which adds structured data into ordinary web pages. It's quite similar to "microdata", a competing technology that was recently endorsed by Google, Microsoft, and Yahoo. Facebook's vocabulary for the things it's interested in is called Open Graph Protocol (OGP), which could be considered a competitor for Schema.org.

My previous post described how a library might use microdata to help users of search engines find things in the library. While I think that eventually this will be an necessity for every library offering digital services, the are a bunch of caveats that limit the short-term utility of doing so. Some of these were neatly described in a post by Ed Chamberlain:
  • the library website needs to implement a site-map that search engine's crawlers can use to find all the items in the Library's catalog
  • the library's catalog needs to be efficient enough to not be burdened by the crawlers. Many library catalog systems are disgracefully inefficient.
  • the library's catalog needs to support persistent URLs. (Most systems do this, but it was only ten years ago that I caused Harvard's catalog to crash by trying to get it to persist links. Sorry.)
But the clincher is that web search engines are still suspicious of metadata. Spammers are constantly trying to deceive search engines. So search engines have white-lists, and unless your website is on the white-list, the search engines won't trust your structured metadata. The data might be of great use to a specialized crawler designed to aggregate metadata from libraries, but there's a chicken and egg problem: these crawlers won't be built before libraries start publishing their data.

Facebook's OGP may have more immediate benefits. Libraries are inextricably linked to their communities; what is a community if not a web of relationships? Libraries are uniquely positioned to insert books into real world social networks. A phrase I heard at ALA was "Libraries are about connections, not collections".

Libraries don't need to implement OGP to put a like button on a web page, but without OGP Facebook would understand the "Like" to be about the web page, rather than about the book or other library item.

To show what OGP might look like on a library catalog page, using the same example I used in my post on "spoonfeeding library data to search engines":
<html> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Open Graph Protocol wants the web page to be the digital surrogate for the thing to be inserted into the social graph, and so it wants to see metadata about the thing in the web page's meta tags. Most library catalog systems already put metadata in metatags, so this part shouldn't be horribly impossible.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" content="book"/>
<meta property="og:isbn" content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" content="Example Library"/>
<meta property="fb:admins" content="USER_ID"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first thing that OGP does is to call out xml namespaces- one for xhtml, a second for Open Graph Protocol, and a third for some specific-to-Facebook properties. A brief look at OGP reveals that it's even more bare bones than schema.org; you can't even express the fact that "Paul Bryers" is the author of "Avatar".

This is less of an issue than you might imagine, because OGP uses a syntax that's a subset of RDFa, so you can add namespaces and structured data to your heart's desire, though Facebook will probably ignore it.
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:og="http://ogp.me/ns#"
      xmlns:fb="http://www.facebook.com/2008/fbml"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:foaf="http://xmlns.com/foaf/0.1/"> 
<head> 
<title>Avatar (Mysteries of Septagram, #2)</title>
<meta property="og:title" 
      content="Avatar - Mysteries of Septagram #2"/>
<meta property="og:type" 
      content="book"/>
<meta property="og:isbn" 
      content="9780340930762"/>
<meta property="og:url"   
      content="http://library.example.edu/isbn/9780340930762"/>
<meta property="og:image" 
      content="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"/>
<meta property="og:site_name" 
      content="Example Library"/>
<meta property="fb:app_id" 
      content="183518461711560"/>
</head> 
<body> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span rel="dc:creator">Author: 
    <span typeof="foaf:Person" 
        property="foaf:name">Paul Bryers
    </span> (born 1945)
 </span>
 <span rel="dc:subject">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The next step is to add the actual like button by embedding a javascript from Facebook:
<div id="fb-root"></div>
<script   src="http://connect.facebook.net/en_US/all.js#appId=183518461711560&xfbml=1"></script>
<fb:like href="http://library.example.edu/isbn/9780340930762/" 
       send="false" width="450" show_faces="false" font=""></fb:like>

The "og:url" property tells facebook the "canonical" url for this page- the url that Facebook should scrape the metadata from.

Now here's a big problem. Once you put the like button javascript on a web page, Facebook can track all the users that visit that page. This goes against the traditional privacy expectations that users have of libraries. In some jurisdictions, it may even be against the law for a public library to allow a third party to track users in this way. I expect it shouldn't be hard to modify the implementation so that the script is executed only if the user clicks the "Like" button, but I've not been able to find a case anyone has done this.

It seems to me that injecting library resources into social networks is important. The libraries and the social networks that figure out how to do that will enrich our communities and the great global graph that is humanity.
Enhanced by Zemanta

Wednesday, June 22, 2011

EPUB 3 Beefs Up Metadata, but Omits Semantic Enrichment

Ironic amusement fills me when I hear book industry people say things like "metadata has become cool", or "context is everything". Welcome to the 20th century and all that. Meanwhile, in the library industry, metadata has been cool long enough to coat everything with a thick rind of freezer burn.

There's good news and notsogood news for ebook metadata. The revision to the EPUB standard, published just a month ago, includes metadata tools that could eventually lead to a new era of metadata cooperation between publishers and the entire book supply chain, including libraries. At the same time, the revision fails to take advantage of ready-made vehicles for semantic enrichment of content, a move that could still provide new types of revenue for publishers while giving libraries new opportunities to remain relevant as books become digital.

Since I'm incurably optimistic, I'll start with the half-full glass: Publication-level metadata. EPUB 3 includes a whole bunch of ways to include publication-level metadata in an EPUB container. As an example, imagine an EPUB3 for "Emma" with this mark-up in its package document (essentially the navigation directory for the book):
<metadata>
...
<meta property="dcterms:identifier"
id="pub-id">urn:uuid:A1B0D67E-2E81-4DF5-9E67-A64CBE366809</meta>
<link rel="marc21xml-record" href="http://www.archive.org/download/cihm_29722/cihm_29722_marc.xml" />
<link rel="marc21xml-record"
href="/cihm_29722_marc.xml" />
<link rel="foaf:homepage" href="http://openlibrary.org/books/OL24234129M/Emma" />
...
</metadata>

In this example, the first link element points to a MARC 21 xml record (MARC 21 is a blattarian standard for library metadata (look it up)) at the Internet Archive. The second link element points to the same record included in the EPUB container itself. There is also built-in vocabulary that allows the link element to point to ONIX, MODS, and XMP metadata records.

The example also shows that other vocabularies (such as FOAF) can be added for use in metadata elements. So, if you're a believer in RDA, you can put that in an EPUB file as well.

The meta element can also be used in the EPUB package document's metadata block. It's defined quite differently from HTML5's empty meta element, with an about attribute and allowed text content. In principle, it can be used to encode arbitrary RDF triples, thanks to a prefix extension mechanism borrowed from RDFa which allows EPUB authors to add vocabularies to their documents.

These capabilities, on their own, could support major changes in the way that books are produced, delivered and accessed. In a publisher workflow, the EPUB file could serve as the carrier for all the components and versions of a book, even bits that today might be left out or lost in the caverns of so-called "content management systems". A distributor would no longer need to match up content files with records in a separate metadata feed. EPUB books for libraries could be preloaded with cataloging and enrichment data, greatly simplifying the process of making the ebooks accessible in libraries.

Given the great advances for "package-level" metadata, it's a bit disappointing that semantic mark-up of content documents missed the EPUB 3 boat. The story is a bit complicated, and it's far from over. Imagine that you want to add mark-up to a book's citations- perhaps you want to embed identifiers to support library linking systems. Or perhaps you're a medical publisher and you want to embed machine readable statements about drugs and diseases in a pharmaceutical textbook. Or perhaps you want to publish a travel guide and you want search engines to pick out the places you're describing. These applications are not really supported by the current version of EPUB 3.

EPUB content documents have a feature that you might think would do the trick, but doesn't really. The epub:type attribute supports "semantic inflection" of elements. This attribute can be used to mark a paragraph as a bibliographic citation, for example, and supports many of the requirements imposed by conversion of content from legacy or specialized formats into the HTML5 dialect used by EPUB. It's an important feature, but not enough to support semantic enrichment.

Part of the problem is EPUB 3's dependence on HTML5, which is not yet a stable spec and is enmeshed in some surprisingly raw W3C politics. W3C has been the home of HTML standards development since the very early stages of the web, and has also been the home of semantic web standards development. HTML5 started outside of W3C in the WHATWG, an initiative to develop HTML in a way that would be backwards compatible with good-old fashioned non-XML HTML. W3C was convinced to fold WHATWG into its development efforts because of WHATWG's corporate backing. Even so, the WHATWG version of the HTML5 spec drips with sarcasm towards W3C HTML Working Group decisions.

During part of the development of EPUB 3, the HTML5 draft included "Microdata", a method of embedding semantic mark-up in HTML. RDFa, a standard that competes with Microdata, was developed by W3C channels, and within W3C, it was decided in February of 2010 to move Microdata out of the HTML spec so as to give it equal footing with RDFa. Some participants in the EPUB working group wanted to include RDFa in the standard; others thought this would impose too much of a complexity burden on publisher-implementers. The EPUB draft ended up being released without either RDFa or Microdata.

The recent endorsement of Microdata by the Google-Yahoo-Bing cooperation has changed the competitive landscape for embedded semantics. It's now apparent that Microdata will get priority implementation in HTML development tools, leaving RDFa as a niche technology. For most use cases of EPUB semantic markup, the differences between RDFa and Microdata are small compared to the advantages of piggybacking on the technology investment supporting website creation.

According to members of the EPUB working group, it is expected that a dot release will follow relatively quickly behind EPUB 3.0. It seems to me that picking a semantic markup technology for content documents should now not be so hard. If you work for a publishing company that has ever mentioned semantic markup in a product plan, you should probably be making sure that the EPUB working group is aware of your needs. If you are a librarian who can imagine the possibilities of a semantically enriched EPUB collection, you should similarly be making your concerns known.

Although the EPUB working group includes representatives from tools vendors that might conceivably benefit from the adoption of EPUB-only constructs, the group's track record for adopting wider web standards has been very encouraging. By adopting HTML5 as a stack component, the group has ensured that cheap or free tools to produce and author EPUB 3 content will be readily available.

Once semantic enrichment of ebooks becomes routine, libraries will play a vital role in their use. Libraries provide a copyright-friendly DRM-free community commons in which users can access and build on the information contained in licensed content. (Of course, I see "unglued" books as playing an equally important role in the library commons.)

The EPUB metadata glass is half full, and there's more wine in the bottle!

Note: This is one thing I'll be talking about on Saturday at the American Library Association meeting in New Orleans. (The program is somewhat inaccurate; the program will end at 10:30 AM at the latest. Ross Singer from Talis will lead off with an overview of semantic web technologies in libraries; I'll follow with discussions of RDFa, the Facebook "Like" button and of course, EPUB.
Enhanced by Zemanta

Wednesday, June 8, 2011

Our Metadata Overlords and That Microdata Thingy

On June 2, our Metadata Overlords spoke. They told us that they'll only listen when we tell them things using a specialized vocabulary they've now given us at the schema.org website. Although we can still use our stone tablets if that's what we're using now, we're expected to migrate to a new Microdata Thingy, assuming that we really want them to pay attention to our website metadata supplications.

There are among us believers, who, led by druids enraptured by the power of stone tablets to carry truth, will shun the new thingy, but most of us will meekly comply with the edicts of the overlords. We're not able to distinguish the druidic language of the tablets from the new liturgy of of the state church. Many things are difficult to articulate in the new vocabulary, but gosh, those tablets were heavy to carry around. And the new thingy doesn't seem so awful, although it's difficult to tell with the mumbled sermons and hymn singing and all.

I hope the overlords don't try to take our pagan rituals of Friending and Liking away from us, though. The incantations used to invoke and bless the Like ritual also use the druidic language, and the help scrolls tell us we might confuse the overlords if we use more than one language in our prayers.

My soul remains troubled, however, at the thought that the Overlords care not for truth and for justice. Sometimes it seems as though the overlords want only for our offerings of attention and seek only to feed our lust for food, drink, entertainment, debauchery and money. Yes, there are new words for our books and learning, but we can say so little about these in schema.org language that our wizards and mages will be mute if they ever choose to enter that realm.

I myself was present at a conclave of such mages and wizards dedicated to the entwinement of data from libraries, museums and archives in full openness. When tweet of the new order came, we endeavored to learn more of schema.org and its thingy. We questioned whether the thingy was an abomination against openness, or whether we might exploit its Overlord endorsement to make our own spells more powerful. We agreed to teach each other our new thingy spells, even as our colleagues elsewhere figured out how to chisel the new vocabulary into stone. Word came from other lands that the new vessel would founder trying to cross the seas.

We then visited the temple of the archive and found the servers cool to the touch. We heard words from a past oracle, ate as they never ate in Rome, drank cool drafts, and returned home emboldened with an enlarged appreciation of intermingled bits.

So it was said, so shall we do.

Notes:
  1. Google's blog post on adopting microdata was signed by R. V. Guha who had a bit to do with the creation of RDF.
  2. It's not really a surprise that Google doesn't care about RDFa. In my article on RDFa from 2009, I pointed to mistakes that Google made in their RDFa documentation. They never fixed it.
  3. Schema.org can't even list all of its schemata- the web page, chock full of non-breaking spaces, is truncated!.
  4. The current microdata spec is in an odd state where it's confused about how to define an itemtype. In fact, the mechanism for defining new itemtypes is gone! Here's what it says:
    The item type must be a type defined in an applicable specification.

    Except if otherwise specified by that specification, the URL given as the item type should not be automatically dereferenced.

    A specification could define that its item type can be derefenced to provide the user with help information, for example. In fact, vocabulary authors are encouraged to provide useful information at the given URL.
    Apparently, stuff was removed for some sort of political reason- it's there in the WHAT-WG version; note that Google links to the W3C version, which is not fully baked.
  5. the Schema.org terms of service are creepy when you get to the part about patents.
  6. The big selling point for RDFa was that Google, Yahoo and Bing supported it for Rich Snippets and the like. But Microdata's inability to easily support complex markup turned out to be an key feature for the search engines. The moral of the story for standards developers: your best customers are always righter than the others.
  7. In the video, Brewster Kahle reads from the last page of A Manual on Methods of Reproducing Research Material by Robert C. Binkley (1936). OCLC Number 14753642. Peter Binkley, a meeting participant, donated a copy of his grandfather's book to the Internet Archive, along with permission to make it free to the public.
  8. Henri Sivonen has written a very readable and informed discussion about Microdata, RDFa, Schema.org and the process of making standards that you should read if you are interested in why things are the way they are in HTML5.

Saturday, January 8, 2011

Inside the Dataculture Industry

wild blueberries
I don't really know how all the food gets to my table. Sure, I've gathered berries, baled hay, picked peas, baked bread and smoked fish, but I've never slaughtered a pig, (successfully) milked a cow or roasted coffee beans. In my grandparents generation, I would have seemed rather ignorant and useless. Agriculture has become an industry as specialized as any other modern industry; increasingly inaccessible to the layperson or small business.

I do know a bit about how data gets to my browser. It gets harvested by data farmers and data miners, it gets spun into databases, and then gets woven into your everyday information diet. Although you've probably heard of the "web of data", you're probably not even aware of being surrounded by data cloth.

The dataculture industry is very diverse, reflecting the diversity of human curiosity and knowledge. Common to all corners of the industry is the structural alchemy that transmutes formless bits into precious nuggets of information.

In many cases, this structuring of information is layered on top of conventional publishing. My favorite example of this is that the publishers of "Entertainment Week" extract facts out of their stories and structure them with an extensive ontology. Their ontologists (yes, EW has ontologists!) have defined an attribute "wasInRehabWith" so that they can generate a starlet's biography and report to you that she attended a drug rehabilitation clinic at the same time as the co-star of her current movie. Inquiring minds want to know!

If you look at location based services such as Facebook's "places", Foursquare, Yelp, Google Maps, etc, they will often present you with information pulled from other services. Often, a description comes from Wikipedia and reviews come from Yelp or Tripadvisor and photos come from Panoramio or Flickr. These services connect users to data using a common metadata backbone of Geotags. Data sets are pulled from source sites in various ways.

Some datasets are produced in data factories. I had a chance to see one of these "factories" on my trip to India last month. Rooms full of data technicians (women do the morning shift, men the evening) sit at internet connected computers and supervise the structuring of data from the internet. Most of the work is semi-automated, software does most of the data extraction. The technicians act as supervisors who step in when the software is too stupid to know when it's mangling things and when human input is really needed.

There's been a lot of discussion lately about how spammers are using data scraped from other websites and ruining the usefulness of Google's search results. There are plenty of companies that offer data scraping services to fuel this trend. Data scraping is the use of software that mimics human web browsing to visit thousands of web pages and capture the data that's on them. This works because large websites are generated dynamically out of databases; when machines assemble web pages, machines can disassemble them.

A look at the variety of data scraping companies reveals a broad spectrum. Scraping is an essential technology for dataculture; as with any technology, it can be used to many ends. One company boasts of their "massive network of stealth scrapers capable of downloading massive amounts of data without ever getting blocked. Some companies, such as Mozenda, offer software to license. Others, such as Xtractly and Addtoit are strictly service offerings.

I spoke to Addtoit's President, Bill Brown, about his industry. Addtoit got its start doing projects for Reuters and other firms in the financial industry; their client base has since become more "balanced". Companies such as Bloomberg, Reuters and D&B get paid premiums to provide environments rich in structured data by customers wanting a leg up on competitors. Brown's view is that the industry will move away from labor intensive operations to being completely automated, and Addtoit has developed accordingly.

A small number of companies, notably Best Buy, have realized that making their data easily available can benefit them by promoting commerce and competition. They have begun to use technologies such as RDFa to make it easy for machines to read data on their web sites; scraping becomes superfluous. RDFa is a method of embedding RDF metadata in HTML web pages; RDF is the general data model standardized by the W3C for use on the semantic web, which has been discussed much on this blog.

This doesn't work for many types of data. Brown sees very slow adoption of RDFa and similar technologies but thinks website data will gradually become easier to get at. Most websites are very simple, and their owners see little need or benefit in investing in newer website technologies. If people who really want the data can hire firms like Addtoit to obtain the data, most of the potential benefits to website owners of making their data available accrue without needing technology shifts.

The library industry is slowly freeing itself from the strictures of "library data" and is broadening its data horizons. For example, many libraries have found that genealogical databases are very popular with patrons. But there is a huge world of data out there waiting to be structured and made useful. One of the most interesting dataculture companies to emerge over the last year is ShipIndex. As you'd expect from the name, ShipIndex is a vast directory of information relating to ships. Just as place information is tied together with geoposition data, ShipIndex ties together the world of information by identifying ships and their occurrence in the world's literature. The URIs in ShipIndex are very suitable for linking from other resources.

The Götheborg
ShipIndex is proof that a "family farm" can still deliver value in the dataculture industry. The process used to build ShipIndex. Nonetheless, in coming years you should expect that technologies developed for the financial industry will see broader application and will lead to the creation of data products that you can scarcely imagine.

The business model for ShipIndex includes free access plus a fee-for-premium-access model. One question I have is how effectively libraries will be able leverage the premium data provided with this model. Imagine for example the value you might get from a connection between ShipIndex and a geneological database bound by passenger manifests. I would be able to discover the famous people who rode the same ship that my parents took to and from the US and Sweden (my mom rode the Stockholm on the crossing before it collided with the Andrea Doria). For now though, libraries struggle to leverage the data they have; better data licensing models are way down on the list of priorities for most libraries.

Peter McCracken
ShipIndex was started by Peter and Mike McCracken, who I've known since 2000. Their previous company (SerialsSolutions) and my previous company (Openly Informatics) both had exhibit tables in the "Small Press" section of the American Library Association exhibit hall, where you'll often find the next generation of innovative companies serving the library industry. They'll be back in the Small Press Section at this weekend's ALA Midwinter meeting. Peter has promised to sing a "shanty" (or was that a scupper?) for anyone who signs up for a free trial. You could probably get Mike to do a break dance if you prefer.

I'll be floating around the meeting too. If you find me and say hello, I promise not to sing anything.
Enhanced by Zemanta

Sunday, June 27, 2010

Global Warming of Linked Data in Libraries

Libraries are unusual social institutions in many respects; perhaps the most bizarre is their reverence for metadata and its evangelism. What other institution considers the production, protection and promulgation of metadata to be part of its public purpose?

The W3C's Linked Data activity shares this unusual mission. For the past decade, W3C has been developing a technology stack and methodology designed to support the publication and reuse of metadata; adoption of these technologies has been slow and steady, but the impact of this work has fallen short of its stated ambitions.

I've been at the American Library Association's Annual Meeting this weekend. Given the common purpose of libraries and Linked Data, you would think that Linked Data would be a hot topic of discussion. The weather here has been much hotter than Linked Data, which I would describe as "globally warming". I've attended two sessions covering Linked Data, each attended by between 50 and 100 delegates. These followed a day long, sold-out  preconference. John Phipps, one of the leaders in the effort to make library metadata compatible with the semantic web, remarked to me that these meeting would not have been possible even a year ago. Still, this attendance reflects only a tiny fraction of metadata workers at the conference; Linked Data has quite a ways to come. It's only a few months ago that the W3C formed a Library Linked Data Incubator Group.

On Friday morning, there was an "un-conference" organized by Corey Harper from NYU and Karen Coyle, a well-known consultant. I participated in a subgroup looking at use cases for library Linked Data. It took a while for us to get around to use cases though, as participants described that usage was occurring, but they weren't sure what for. Reports from OCLC (VIAF) and Library of Congress (id.loc.gov) both indicated significant usage but little feedback. The VIVO project was described as one with a solid use case (giving faculty members a public web presence), but no one from VIVO was in attendance.

On Sunday morning, a meeting of the Association for Library Collections and Technical Services (ALCTS), Rebecca Guenther, Library of Congress, discussed id.loc.gov, a service that enables both humans and machines to programatically access authority data at the Library of Congress. Perhaps the most significant thing about id.loc.gov is not what it does but who is doing it. The Library of Congress provides leadership for the world of library cataloguing; what LC does is often slavishly imitated in libraries throughout the US and the rest of the world.  id.loc.gov started out as a research project but is now officually supported.

Sara Russell-Gonzalez of the University of Florida then presented the VIVO which has won a big chunk of funding from the National Center for Research Resources, a branch of NIH. The goal of VIVO is to build an "interdisciplinary national network enabling collaboration and discovery between scientists across all disciplines." VIVO started at Cornell and has garnered strong institutional support there, as evidenced by an impressive web site. If VIVO is able to gain similar support nationally and internationally, it could become an important component of an international research infrastructure. This is a big "if". I asked if VIVO had figured out how to handle cases where researchers change institutional affiliations; the answer was "No". My question was intentionally difficult; Ian Davis has written cogently about the difficulties RDF has in treating time-dependent relationships. It turns out that there are political issues as well. Cornell has had to deal with a case where an academic department wanted to expunge affiliation data for a researcher who left under cloudy circumstances.

At the un-conference, I urged my breakout group to consider linked data as a way to expose library resources outside of the library world as well as a model for use inside libraries. It's striking to me that libraries seem so focused on efforts such as RDA, which aim to move library data models into Semantic Web compatible formats. What they aren't doing is to make library data easily available in models understandable outside the library.

The two most significant applications of Linked Data technologies so far are Google's Rich Snippets and Facebook's Open Graph Protocol (whose user interface, the "Like" button, is perhaps the semantic webs most elegant and intuitive). Why aren't libraries paying more attention to making their OPAC results compatable with these application by embedding RDFa annotations in their web-facing systems? It seems to me that the entire point of metadata in libraries is to make collections accessible. How better to do this than to weave this metadata into peoples lives via Facebook and Google? Doing this will require the dumbing-down of library metadata and some hard swallowing, but it's access, not metadata quality, that's core to the reason that libraries exist.



Enhanced by Zemanta

Wednesday, April 28, 2010

Pick this Nit: Null Path URIs and the Pedantic Web

There is no surer way to flush out software bugs and configuration errors than to do a sales demo. The process not only exposes the problem, but also sears into the psyche of the demonstrator an irrational desire to see the problem eradicated from the face of the earth, no matter the cost or consequences.
Here's a configuration problem I once found while demonstrating software to a potential customer:
Many library information services can be configured with the base URL for the institution's OpenURL server. The information service then constructs links by appending "?" and a query string onto the base URL. So for example, if the base URL is
http://example.edu/links
and the query string is
isbn=9780393072235&title=The+Big+Short ,
the constructed URL is
http://example.edu/links?isbn=9780393072235&title=The+Big+Short.
For the demo, we had configured the base URL to be very short: http://example.edu, so the constructed URL would have been http://example.edu?isbn=9780393072235&title=The+Big+Short. Everything worked fine when we tested beforehand. For the customer demo, however, we used the customer's computer, which was running some Windows version of Internet Explorer that we hadn't tested, and none of the links worked. Internet Explorer had this wonderful error page that made it seem as if our software had broken the entire web. Luckily, breaking the entire web was not uncommon at the time, and I was able to navigate to a different demo site and make it appear is if I had fixed the entire web, so we managed to make the sale anyway.
It turns out that http URLs with null paths aren't allowed to have query strings. You wouldn't know it if you looked at the W3C documentation for URIs, which is WRONG, but you will see it if you look at the IETF specs, which have jurisdiction (see RFC 1738 and RFC 2616).
Internet Explorer was just implementing the spec, ignoring the possibility that someone might ignore or misinterpret it. The fact that Netscape worked where IE failed could be considered a bug or a feature, but most users probably considered Netscape's acceptance of illegal URLs to be a feature.
I still feel a remnant of  pain every time I see a pathless URL with a query string. Most recently, I saw a whole bunch of them on the thing-described-by site and sent a nit-picky e-mail to the site's developer, and was extremely pleased when he fixed them. (Expeditious error fixing will be richly rewarded in the hereafter.) I've come to recognize, however, that a vast majority of these errors will never be fixed or even noticed, and maybe that's even a good thing.
Nit picking appears to have been a highlight of the Linked Data on the Web Meeting in Raleigh, NC yesterday, which I've followed via Twitter. If you enjoy tales of nerdy data disasters or wonky metadata mischief, you simply must peruse the slides from Andreas Harth's talk (1.8M, pdf) on "Weaving the Pedantic Web". If you're serious about understanding real-world challenges for the Semantic Web, once you've stopped laughing or crying at the slides you should also read the corresponding paper (pdf, 415K ). Harth's co-authors are Aidan Hogan, Alexandre Passant, Stefan Decker, and Axel Polleres from DERI.
The DERI team has studied the incidence of various errors made by publishers of Linked Data "in the wild". Not so surprisingly, they find a lot of problems. For example, they find that 14.3% of triples in the wild use an undeclared property and 8.1% of the triples use an undeclared class. Imagine if a quarter of all sentences published on the web used words that weren't in the dictionary, and you'd have a sense of what that means. 4.7% of typed literals were "ill-typed". If 5% of the numbers in the phone book had the wrong number of digits, you'd probably look for another phone book.
They've even found ways that seemingly innocuous statements can have serious repercussions. It turns out that it's possible to "hijack" a metadata schema, and induce a trillion bad triples with a single Web Ontology Language (OWL) assertion.
Nit Free Terminator Lice Comb, Professional Stainless Steel Louse and Nit Comb for Head Lice Treatment, Removes NitsTo do battle with the enemy of badly published Linked Data, the DERI team urges community involvement in a support group that has been formed to help publishers fix their data. The "Pedantic Web" has 137 members already. This is a very positive and necessary effort. But they should realize that the correct data cause is a hopeless one. The vast majority of potential data publishers really don't care about correctness, especially when some of the mistakes can be so subtle. What they care about is accomplishing specific goals. The users of my linking software only cared that the links worked. HTML authors mostly care only that the web page looks right. Users of Facebook or Google RDFa will only care that the Like buttons or Rich Snippets work, and the fact that the schemas for these things either don't exist in machine readable form or are wildly inconsistent with the documentation is a Big Whoop.
Until of course, somebody does a sales demo, and the entire web crashes.
(nit and head louse photos from Wikimedia Commons)
Enhanced by Zemanta

Thursday, April 22, 2010

Facebook vs. Twitter: To Like or To Annotate?

Facebook and Twitter each held developer conferences recently, and the conference names speak worlds about the competing worldviews. Twitter's conference was called "Chirp", while Facebook's conference was labeled "f8" (pronounced "FATE"). Interestingly, both companies used their developer conferences to announce new capability to integrate meaning into their networks.

Facebook's announcement surrounded something it's calling the "Open Graph protocol". Facebook showed its market power by rolling it out immediately with 30 large partner sites that are allowing users to "Like" them on Facebook. Facebook's vision is that web pages representing "real-world things" such as movies, sports teams, products, etc. should be integrated into Facebook's social graph. If you look at the right-hand column of this blog, you'll see an opportunity to "Like" the Blog on Facebook. That action has the effect of adding a connection between a node that represents you on Facebook with a node that represents the blog on Facebook. The Open Graph API extends that capability by allowing the inclusion of web-page nodes from outside Facebook in the Facebook "graph". A webpage just needs to add a bit of metadata into its HTML to tell Facebook what kind of thing it represents.

I've written previously about RDFa, the technology that Facebook chose to use for Open Graph. It's a well designed method for adding machine-readable metadata into HTML code. It's not the answer to all the world's problems, but it can't hurt. When Google announced it was starting to support RDFa last summer, it seemed to be hedging its bets a bit. Not Facebook.

The effect of using RDFa as an interface is to shift the arena of competition. Instead of forcing developers to choose which APIs to support in code, using RDFa asks developers to choose among metadata vocabularies to support their data model. Like Google, Facebook has created its own vocabularies rather than use someone else's. Also, like Google last summer, the documentation for the metadata schemas seems not to have been a priority. Although Facebook has put up a website for Open Graph protocol at http://opengraphprotocol.org/ and a google group at http://groups.google.com/group/open-graph-protocol, there are as yet no topics approved for discussion in the group. [Update- the group is suddenly active, though tightly moderated.]

Nonetheless, websites that support Facebook's metadata will also be making that metadata available to everyone, including Google, putting increased pressure on websites to make available machine readable metadata  as the ticket price for being included in Facebook's (or anyone's) social graph. A look at Facebook's list of object types shows their business model very clearly. Here's their "Product and Entertainment" category:
  • album
  • book
  • drink
  • food
  • game
  • movie
  • product
  • song
  • tv_show
Whether you "Like" it or not, Facebook is creating a new playing field for advertising by accepting product pages into their social graph.

Facebook clearly believes in that fate follows its intelligent design. Twitter, by contrast, believes its destiny will emerge by evolution from a primordial ooze.

At Twitter's "Chirp" conference, Twitter announced that it will add "Annotations" to the core Twitter platform. The description of Twitter annotations is characteristically fuzzy and undetermined. There will be some sort of triple structure, the annotations will be fixed at a tweet's creation, and annotations will have either 512 bytes or maybe 1K. What will it be used for? Who knows?

Last week, I had a chance to talk to Twitter's Chairman and co-Founder Jack Dorsey at another great "Publishing Point" meeting. He boasted about how Twitter users invented hashtags, retweets and "@" references, and Twitter just followed along. Now, Twitter hopes to do the same thing with annotations. Presumably, the Twitter ecosystem will find a use for Tweet annotations and Twitter can then standardize them. Or not. You could conceivably load the Tweet with Open Graph metadata and produce a Facebook "Like" tweet.

Many possibilities for Tweet annotations, underspecified as they are, spring to mind. For example, the Code4Lib list was buzzing yesterday about the possibility that OpenURL references (the kind used in libraries to link to journal articles and books) could be loaded into an annotated tweet. It seems more likely to me that a standard mechanism to point to external metadata, probably expressed as Linked Data, will emerge. A Tweet could use an annotation to point to a web page loaded with RDFa metadata, or perhaps to a repository of item descriptions such as I mentioned in my post on Linked Descriptions. Clearly, it will be possible in some way or other to put real, actionable literature references into a tweet. Whether it will happen, it's hard to say, but I wouldn't hold my breath for Facebook to start adding scientific articles into its social graph.

Although there's a lot of common capability possible between Facebook's Open Graph and Twitter's Annotations, the worldviews are completely different. Twitter clearly sees itself as a communications media and the Annotations as adjuncts to that communication. In the twitterverse, people are entities that tweet about things. Facebook sees its social graph as its core asset and thinks of the graph as being a world-wide web in and of itself. People and things are nodes on a graph.

While Facebook seems offer a lot more to developers than Twitter, I'm not so sure that I like its worldview as much. I'm much more than a node on Facebook's graph.
Reblog this post [with Zemanta]

Sunday, April 18, 2010

When Shall We Link?

When I was in grad school, my housemates and I would sit around the dinner table and have endless debates about obscure facts like "there's no such thing as brown light". That doesn't happen so much in my current life. Instead, my family starts making fun of me for "whipping out my iPhone" to retreive some obscure fact from Wikipedia to end a discussion about a questionable fact. This phenomenon of having access to huge amounts of information has also changed the imperatives of education: students no longer need to learn "just in case", but they need to learn how to get information "just in time".

In thinking about how to bring semantic technologies to bear on OpenURL and reference linking, it occured to me that "just in time" and "just in case" are useful concepts for thinking about linking technologies. Semantic technogies in general, and Linked Data in particular, seem to have focused on just-in-case, identifier-oriented linking. Library linking systems based on OpenURL, in contrast, have focused on just-in-time description-oriented linking. Of course, this distinction is an oversimplification, but let me explain a bit what I mean.

Let's first step back and take a look at how links are made. Links are directional; they have a start and an end (a target). The start of a link always has an intention or purpose, the target is the completion of that purpose. For example, look at the link I have put on the word "grad school" above. My intention there was to let you, the reader, know something about my graduate school career, without needing to insert that digressional information in the narrative. (Actually my purpose was to illustrate the previous sentence, but let's call that a meta-purpose.) My choice of URL was "http://ee.stanford.edu/", but I might have chosen some very different URL. When I choose a specific URL, I "bind" that URL to my intention.

In the second paragraph, I have added a link for "OpenURL". In that case, I used the "Zemanta" plug-in to help me. Zemanta scans the text of my article for words and concepts that it has links for, and offers them to me as choices to apply to my article. Zemanta has done the work of finding links for a huge number of words and concepts, just in case a user come along with a linking intention to match. In this case, the link suggested by Zemanta matches my intention (to provide background for readers unfamiliar with OpenURL). The URL becomes bound to the word during the article posting process.

At the end of this article, there's a list of related articles, along with a link that says "more fresh articles". I don't know what URLs Zemanta will supply when you click on it, but it's an example of a just in time link. A computer scientist would call this "late binding". My intention is abstract- I want you to  be able to find articles like this one.

Similar facilities are in operation in scholarly publishing, but the processes have a lot more moving parts.

Consider the citation list of a scientific publication. The links expressed by these lists are expressions of the author's intent- perhaps to support an assertion in the article, to acknowledge previous work, or to provide clarification or background. The cited item is described by metadata formatted so that humans can read and understand the description and go to a library to find the item. Here's an example:
D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).
With the movement of articles on-line, the citations are typically turned into links in the publication process by parsing the citation into a computer-readable description. If the publisher is a member of CrossRef, the description could then be matched against CrossRef's huge database of article descriptions. If a match is found, the cited item description is bound to an article identifier, the DOI. For my example article, the DOI is 10.1103/PhysRevLett.48.1559 The DOI provides a layer of indirection that's not found in Zemanta linking. While CrossRef binds the citation to an identifier, the identifier link, http://dx.doi.org/10.1103/PhysRevLett.48.1559, is not bound to the target URL, http://prl.aps.org/abstract/PRL/v48/i22/p1559_1 until the user clicks the link. This scheme holds out hope that should the article move to a different URL, the connection to the citation can be maintained and the link will still work.

If the user is associated with a library using an OpenURL link server, another type of match can be made. OpenURL linkservers use knowledgebases which describe the set of electronic resources made available by the library. When the user clicks on on OpenURL link, the description contained in the link is matched against the knowledgebase, and the user is sent to the best-matching library resource. It's only at the very last moment that the intent of the link is bound to a target.

While the combination of OpenURL and CrossRef has made it possible to link citations to their intended target articles in libraries with good success, there has been little leveraging of this success outside the domain of scholarly articles and books. The NISO standardization process for OpenURL spent a great deal of time in making the framework extensible, but the extension mechanisms have not seen the use that was hoped for.

The level of abstraction of NISO OpenURL is often cited as a reason it has not been adopted outside its original application domain. It should also be clear that many applications that might have used OpenURL have instead turned to Semantic Web and Linked Data technologies (Zemanta is an example of a linking application built with semantic technologies.) If OpenURL and CrossRef could be made friendly to these technologies, the investments made in these systems might also find application in more general circumstances.

I began looking at the possibilities for OpenURL Linked Data last summer, when, at the Semantic Technologies 2009 conference, Google engineers expressed great interest in consuming OpenURL data exposed via RDFa in HTML, which had just been finalized as a W3C Technical Recommendation. I excitedly began to work out what was needed (Tony Hammond, another member of the NISO standardization committee had taken a crack at the same thing.)

My interest flagged, however, as I began to understand the nagging difficulties of mapping OpenURL into an RDF model. OpenURL mapped into RDF was...ugly. I imagined trying to advocate use of OpenURL-RDF over BIBO, an ontology for bibliographic data developed by Bruce D'Arcus and Frédérick Giasson, and decided it would not be fun. There's nothing terribly wrong with BIBO.

One of the nagging difficulties was that OpenURL-RDF required the use of "blank nodes", because of its philosophy of transporting descriptions of items which might not have URIs to identify them. When I recently described this difficulty to the OpenURL Listserv, Herbert van de Sompel, the "irresistible force" behind OpenURL a decade ago, responded with very interesting notes about "thing-described-by.org", how it resembled "by-reference" OpenURL, and how this could be used in a Linked Data  friendly link resolver. Thing-Described-by is a little service that makes it easy to mint a URI, attach an RDF description to it, and make it available for harvest as Linked Data.

In the broadest picture, linking is a process of matching the intent of a link with a target. To accomplish that, we can't get around the fact that we're matching one description with another. A link resolver needs to accomplish this match in less than a second using a description squeezed into a URL, so it must rely on heuristics, pre-matched identifiers, and restricted content domains. If link descriptions were pre-published as Linked Data as in thing-described-by.org, linking providers would have time to increase accuracy by consulting more types of information and provide broader coverage. By avoiding the necessity of converting and squeezing the description into a URL, link publishers could conceivably reduce costs while providing for richer links. Let's call it "Linked Description Data".

Descriptions of targets could also be published as Linked Description Data. Target knowledgebase development and maintenance is a significant expense for link server vendors. However, target publishers have come to understand the importance (see KBART) of providing more timely, accurate and granular target descriptions. If they ever start to view the knowledgebase vendors as bottlenecks, the Linked Description Data approach may prove appealing.

Computers don't learn "just-in-time" or "just-in-case" the way humans do. But the matching at the core of making links can be an expensive process, taking time proportional to the square of the number of items (N2). Identifiers make the process vastly more efficient, (N*logN). This expense can be front-loaded (just-in-case) or saved till the last momemt (just-in-time), but opening the descriptions being matched for "when-there's-time" processing could result in dramatic advances in linking systems as a whole.
Reblog this post [with Zemanta]

Wednesday, May 20, 2009

Reif#&cation Part 2: The Future of the RDF, RDFa, and the Semantic Web is Behind Us

In Reif#&cation Part 1, I introduced the concept of reification and its role in RDF and the Semantic Web. in Part 3, I'll discuss the pros and cons of reification. Today, I'll show some RDFa examples.

I've spent the last couple of days catching up on lots of things that have happened over the last few years while the semantic web part of my brain was on vacation. I was hoping to be able to give some examples of reification in RDFa using the vocabulary that Google announced it was supporting, but I'm not going to be able to do that, because the Google vocabulary is structured so that you can't do anything useful with reification. There are some useful lessons to draw from this little fact. First of all, you can usually avoid reification by designing your domain model to avoid it. You should probably avoid it too if you can. In the Google vocabulary, a Review is a first-class object with a reviewer property. The assertion that a product has rating 3 stars is not made directly by a reviewer, but indirectly by a review created by a reviewer.

Let's take a look at the html snippet presented by Google on their help page for RDFa (It's permissible to skip past the code if you like.):


<div xmlns:v="http://rdf.data-vocabulary.org/#"
typeof="v:Review">
<p><strong><span property="v:itemReviewed">
Blast 'Em Up</span>
Review</strong></p>
<p>by <span rel="v:reviewer">
<span typeof="v:Person">
<span property="v:name">Bob Smith</span>,
<span property="v:title">Senior
Editor</span> at ACME Reviews
</span>
</span></p>
<p><span property="v:description">This is a great
game. I enjoyed it from the opening battle to the final
showdown with the evil aliens.</span></p>

</div>

(Note that I've corrected a bunch of Google's sloppy mistakes here- the help page erroneously had "v:person", "v:itemreviewed" and "v:review" where "v:Person", "v:itemReviewed" and "v:Review" would be been correct according to their published documentation. I've also removed an affiliation assertion that is hard to fix for reasons that are not relevant to this discussion, and I've fixed the non-well-formedness of the Google example. )

The six RDF triples embedded here are:

subject: this block of html (call it "ThisReview")
predicate: is of type
object: google-blessed-type "Review"

subject: ThisReview
predicate: is reviewing the item
object: "Blast 'Em Up"

subject: ThisReview
predicate: has reviewer
object: a google-blessed-type "Person"

subject: a thing of google-blessed-type "Person"
(call it BobSmith)
predicate: is named
object: "Bob Smith"

subject: BobSmith
predicate: has title
object: "Senior Editor"

subject: ThisReview
predicate: gives description
object: "This is a great game. I enjoyed it from the
opening battle to the final showdown with the evil
aliens."

Notice that in Google's favored vocabulary, Person and Review are first-class objects and the item being reviewed is not (though they defined a class that might be appropriate). An alternate design would be to make the item a first class object and the review a predicate that could be applied to RDF statements. The seven triples for that would be

subject: a thing of google-blessed-type "Product"
(call it BlastEmUp)
predicate: is named
object: "Blast 'Em Up"

subject: BobSmith
predicate: is named
object: "Bob Smith"

subject: BobSmith
predicate: has title
object: "Senior Editor"

subject: an RDF statement (call it TheReview)
predicate: has creator
object: BobSmith

subject: TheReview
predicate: has subject
object: BlastEmUp

subject: TheReview
predicate: has predicate
object: gives description

subject: TheReview
predicate: has object
object: "This is a great game. I enjoyed it from the
opening battle to the final showdown with the evil
aliens."

To put those triples in the same HTML, I do this:


<div xmlns:v="http://rdf.data-vocabulary.org/#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
typeof="rdf:Statement"
rel="dc:creator"
href="#BobSmith">
<p><strong>
<span property="rdf:subject">
<span typeof="v:Product">
<span property="v:name">Blast 'Em Up</span>
</span>
</span> Review</strong></p>
<p>by <span typeof="v:Person" id="BobSmith">
<span property="v:name">Bob Smith</span>,
<span property="v:title">Senior Editor</span>
at ACME Reviews
</span></p>
<p><span property="rdf:predicate"
resource="v:description"/>
<span property="rdf:object">This is a great
game. I enjoyed it from the opening battle
to the final showdown with the evil
aliens.</span></p>
</div>

I've drawn one extra bit of vocabulary from the venerable "Dublin Core" vocabulary, "dc:creator", to do this.

Some observations:
  1. Reification requires a bit of gymnastics even for something simple; if I wanted to reify more than one triple, it would start to look really ugly.
  2. Through use of a thought-out knowledge model, I can avoid the need for reification.
  3. The Knowledge model has a huge impact on the way I embed the information.

This last point is worth thinking about further. It means for you and me to exchange knowledge using RDFa or RDF, we need to share more than a vocabulary, we need to share a knowledge model. It reminds me of another story I heard on NPR, about the Aymara people of the Andean highlands, whose language expresses the future as being behind them, whereas in English and other western languages the future is thought of as being in front of us. We can know the vocabulary for front and back in Aymarian, but because we don't share the same knowledge model, we wouldn't be able to successfully speak to an Aymarian about the past and the future.

Wednesday, May 13, 2009

My Pathetic Life (#mpl) and Collaborative Intelligence

I'm still pretty new to Twitter, but it feels pretty familiar to me. Part of the reason for that is that some of my Facebook friends have been parallel posting their status on Facebook and on Twitter. This mostly annoyed me, because Twitterers update their statuses more often than facebookers, and a lot of the Twitter vernacular is totally inexplicable when viewed on Facebook. On Twitter, by contrast, I find that I'm annoyed when people that I follow, but don't really know, mix details from their personal lives into their otherwise interesting Twitter streams. For example, I follow dchud because I know that he will throw out some very interesting ideas. (He was the very first person to follow me on Twitter; I was the very first person to comment on his blog way back when). But I'm not really interested in his reports on the Washington Capitols. Somehow I find that Facebook is a much better place to get to know details like that- if you friend me there, you'll find that I'm a rabid fan of the Philadelphia Phillies, and I won't mind it if you update me on the triumphs of your Columbus Blue Jackets. We both knew they would lose eventually.

The thing that intrigues me about Twitter is that it does so little so well that it's really a lot easier to fix the problems that it has. The past two days I've been writing about the challenges of propagating vocabulary (and grammar for that matter!) for use in the semantic web. Yesterday, Google demonstrated one way of propagating vocabulary- be big and powerful and just tell the world what vocabulary to use. Ian Davis called Google's approach to implementing RDFa "a damp squib" which is what Americans would less colorfully call a "dud" or a wet firecracker. He lamented that Google had chosen to use their own limited vocabulary rather than adopt vocabulary already in use. R.V. Guha, who I mentioned in yesterday's post, commented on Davis' blog that we shouldn't judge too soon what Google is doing. A lot of us are hoping that "igniter fuse" will turn out to be an apter pyrotechnical analogy.

The other strategy for vocabulary propgation is based on community-based collaboration. In my post yesterday, I complained that it was hard to find vocabulary that I might use to attach an ISBN to a resource. In contrast, Twitter, together with the accessories that are built around it, seems to enable rapid propagation of vocabulary and grammar. So back to my complaint about how Twitter streams seem to annoyingly mix tweets of varying interest. One way that people deal with this is to use multiple accounts to organize their tweets in different genres, the same way you might want to have a business email and a personal email. Perhaps a better, more flexible way to address this would be to adopt a special hashtag to signal that a tweet is not a product of one's brilliant intellect, but rather just a status message about "my personal life" (#mpl). That way, your mom can easily filter out your irrelevant work stuff and and your boss can filter out your irrelevant personal stuff. Anyway, if you think this a good idea, see if you can help propagate the use of #mpl (let's call the idea #mplIdea). If you don't think it's a good idea, or if there's some better way to fix this twitter deficiency, leave a comment. Let's see if we can demonstrate the power of the collaborative approach to vocabulary propagation.

The adoption of RDFa by Google and their centralized approach to vocabulary may be a turning point in the first stage of the semantic web- that of using the web to aggregate data. I think this approach is not going to take us very far. We need to start building the second stage of the semantic web. We should be thinking about collaborative intelligence rather than about accumulating distributed sets of data. I don't expect that machines will be able to come up with ideas like #mplIdea, but I do think it's reasonable for machines to be able to help us judge wether ideas like #mplIdea are inspired (and should be propagated) or whether they're just stupid.
Reblog this post [with Zemanta]

Tuesday, May 12, 2009

Google, RDFa, and Reusing Vocabularies

Yesterday, I wrote about one difficulty of having machines talk to other machines- propagation and re-use of vocabularies is not something that machines being used today know how to do on their own. I thought it would be instructive to work out a real example of how I might find and reuse vocabulary to express that a work has a certain ISBN (international standard book number). What I found (not to my great surprise) was that it wasn't that easy for me, a moderately intelligent human with some experience at RDF development, to find RDF terminology to use. I tried Knoodl, Google, and SchemaWeb to help me.

Before I complete that thought, I should mention that today Google announced that they've begun supporting RDFa and microformats in what they call "rich snippets". RDFa is a mechanism for embedding RDF in static HTML web pages, while microformats are a simpler and less formalized way to embed metadata in web pages. Using either mechanism, this means that web page authors can hide information in structures mean to be read by machines in the same web pages that humans can read.

Concentrating on just the RDFa mechanism, it's interesting to see how Google expects that vocabulary will be propagated to agents that want to contribute to the semantic web: Google will announce the vocabulary that it understands, and everyone else will use that vocabulary. Resistance is futile. Not only does Google have the market power to set a de facto standard, but it has the intellectual power to do a good job of it- one of the engineers on the Google team working on "rich snippets" is Ramanathan V. Guha, who happens to be one of the inventors of RDF.

You would think that It would be easy to find an RDF property that has been declared to use in assertions like "the ISBN of 'digital Copyright' is 1-57392-889-5". No such luck. Dublin Core, a schema developed in part by the library community, has an "identifier" element which can be modified to indicate the element contains an isbn, but no isbn property. Maybe I just couldn't find it. Similarly, MODS, which is closely related to library standards, has an identifierType element type that can contain an ISBN, but you have to add type=isbn to the element to make it an ISBN. Documentation for RDFa wants you to use the ISBN to make a urn and to make this the subject of your assertion, not an attribute (ignoring the fact the ISBN identifies things that you sell in a bookstore (for example, the paperback version of a book) rather than what most humans think of as books. I also found entries for isbn in schemes like The Agricultural Metadata Element Set v.1.1 and a mention in the IMS Learning Resource Meta-Data XML Binding. Finally I should note that while OpenURL (a standard that I worked on) provides an XML format which includes an ISBN element, it's defined in such a way that it can't be used in other schemas.

The case of ISBN illustrates some of the barriers to vocabulary reuse, and although there are those who are criticizing Google for not reusing vocabulary, you can see why Google thinks it could work better if they just define vocabulary by fiat.