Thoughts on Citability

Over this week­end (9 – 11th April) I watched on Ustream the Cit­ab­il­ity CODEATHON. I already knew about Cit​ab​il​ity​.org from Silona Bone­wald (@Silona on twit­ter), but the codea­thon (from an spec­tator point of view) was very inter­est­ing as both dis­cus­sions and prototypes.

What is cit​ab​il​ity​.org?
Cit­ab­il­ity sup­ports mak­ing pub­lic gov­ern­ment doc­u­ments and data avail­able online and cit­able such that they can be eas­ily ref­er­enced for pub­lic debate, com­ment­ary and ana­lysis. This requires that archived ver­sions of doc­u­ments be stored and link­able so that changes can be eas­ily spot­ted and ref­er­ence links remain intact.
(from http://​dccodea​thon​.pbworks​.com)

I was think­ing of changes, a live doc­u­ment as opposed to its archived snap­shot may:

  • change its location
  • change its present­a­tion, and by this pos­sibly break­ing intra-​document addressing
  • undergo minor changes (let’s call them lex­ical ones) such as spellcheck/​grammar/​punctuation changes
  • undergo major changes, semantic ones (as opposed to the above lex­ical ones)

In 2000, Thomas A. Phelps and Robert Wilensky wrote “Robust Hyper­links and Loc­a­tions” where they describe how lex­ical sig­na­tures can be used to find a moved doc­u­ment (or to find cop­ies of the same doc­u­ment) and also they addressed the issue of “Robust Intra-​document Loc­a­tions.”

Basic­ally a lex­ical sig­na­ture of a doc­u­ment is a set of keywords (set com­puted with a TF/​IDF–like algorithm) which when used in a search with Google will return as top hit the same doc­u­ment, and/​or cop­ies of it. Since this sig­na­ture is com­puted against “Google’s cor­pus,” in time the res­ults will skew as the cor­pus changes. But since Cit­ab­il­ity keeps snap­shots of the ori­ginal doc­u­ments, their lex­ical sig­na­tures can be re-​computed.

I believe that Cit­ab­il­ity could eas­ily employ those lex­ical sig­na­tures tech­nique to loc­ate moved or duplic­ate doc­u­ments, and help detect minor lex­ical changes (which could be more flex­ible than raw hashes). Moreover, the intra-​document re-​attachment algorithm can help in detect­ing doc­u­ment struc­tural changes and re-​attach cita­tions or just help observe doc­u­ments’ evolution.

In the past years, part of my exper­i­ments regard­ing semantic nav­ig­a­tion, I used such lex­ical sig­na­ture as input into an onto­logy search engine (Wat­son) to dis­cover what onto­lo­gies can cover a spe­cific doc­u­ment, and provide a way to dis­cover semantic­ally (and not just lex­ic­ally) related doc­u­ments, while steer­ing the doc­u­ment dis­cov­ery through domain-​specific ontologies.

I believe that such semantic sig­na­tures (think of the lex­ical ones, but elev­ated to inter­linked con­cepts) apart of enabling topic-​based nav­ig­a­tion between doc­u­ments or meas­ure semantic sim­il­ar­ity, could be an inter­est­ing way to detect major changes, “semantic” ones as opposed to minor lex­ical ones.

I can­not wait for the Cit­ab­il­ity pro­ject to take off, as their archives would be a valu­able cor­pus for research­ing doc­u­ment sim­il­ar­ity, change detec­tion and doc­u­ment evolution.

Note: The ‘semantic sig­na­tures’ I refer to, are dif­fer­ent from the TextWise.com’s ones.

Reblog this post [with Zemanta]