Why Data Citation Is a Computational Problem

Posted on October 6, 2016 by guidetopharmacology — Leave a comment

By Peter Buneman

The database development team encouraged me to write this off-topic blog on data citation, as it may be of interest to people involved with the IUPHAR/BPS Guide to Pharmacology (GtoPdb).

It must be almost ten years ago that Tony Harmar mentioned that he was thinking of buying digital object identifiers for the then IUPHAR database. It turned out that he was hoping that this would confer some scholarly recognition to the database, but what he really wanted to do was to get people to cite it, just as they would cite any other publication. Among other things, he wanted to ensure that the relevant contributors and curators received proper credit.

I thought about the problem for a while, wrote a rather naive paper about it, and more or less forgot about it for a few years. Then data citation became a hot topic, and with some colleagues started to think about it again. Here’s a problem: GtoPdb does a passable job of specifying the citation for each page that you see in the Web presentation, but what citation would you provide for some arbitrary SQL query on the underlying data? It turns out that this is a ubiquitous problem in data citation, and one that is tricky to solve in general.

My colleagues Susan Davidson, James Frew and I produced a general approach to this and sent it to Communications of the ACM — a publication that is widely read by computer scientists. They liked it to the extent that they made it a cover story and produced a film about it.

So thanks to Tony for the idea and thanks to the curators of GtoPdb for letting us use their database as a guinea pig.

Follow this link to the full CACM article, Why Data Curation Is A Computational Problem.

Follow this link to the video, https://vimeo.com/177314966