the shed
experiments

Out of the cube

For a project that I’m working on at the National Museum of Australia, I’ve started collecting various sources of date-identified data. Most recently I had a go at extracting historical population data from the Australian Bureau of Statistics.

The data can all be downloaded as .xls files, but they’re not simple, flat spreadsheets – they’re data cubes. As the name suggests, data cubes are organised along a number of dimensions. In the case of the population data it’s year, state and gender.

This means that you can’t just export the data to CSV and suck it into your database – first you’ve got to flatten the cube. No doubt there are other ways to do this, but I just wrote a simple python script. It uses xlrd to read from the spreadsheet, does a bit or reorganisation, then writes the output to a CSV file. The code, for what it’s worth, is available at Bitbucket.

Once I had the CSV file I just imported it into MySQL and used Django and Piston to build a basic API. So if you want to know the population of NSW in 1856, you just go to:

http://wraggelabs.com/api/json/population/nsw/1856/

The number of infant deaths in Tasmania in 1932:

http://wraggelabs.com/api/json/infantdeaths/tas/1932/

The number of female births in Australia in 1959:

http://wraggelabs.com/api/json/births/australia/females/1959/

I’m sure you get the picture. You can change the ‘json’ to ‘xml’ if you’d like another flavour of data.

Screenshot of population browser

The API in action - a simple population browser

With an API delivering JSON you can start playing around with all sorts of fun AJAX-y stuff. To demonstrate I built a simple population browser using JQuery. Just drag the slider!

I link therefore I am

Let me be clear. I am not Tim Sherratt the sound engineer. Nor, indeed, am I Timothy Sherratt, author of Saints as Citizens: A Guide to Public Responsibilities for Christians. We are three different people, spread across three continents, locked in a deadly battle for global supremacy via Google search rankings. There can be only one…

Of course you probably knew I wasn’t a British sound engineer or an American politics professor. There are plenty of contextual clues within this website, even on this page, to indicate that my interests lie elsewhere. But while we humans are pretty good at picking up such clues, it’s much harder for computers. When Google comes to index my site, how does it know I’m not a sound engineer who likes to dabble in history? Indeed, how does Google, or any computer know that the words ‘Tim Sherratt’ are actually a person’s name? These are questions of both identity and semantics.

Librarians have been dealing with questions of identity for many, many years developing detailed name authority records. Such records allow name variations to be cross-referenced and individuals to be uniquely identified. For example I have a control number of ‘n 2005043272′ in the Library of Congress authorities database, while Timothy R Sherratt, the politics professor has been assigned ‘n 94106739′.

The National Library of Australia has developed its own name authority file. However, the NLA has realised that reliable identity data has a much broader application that simply cataloguing, and is using its name authority data as the foundation of an exciting new resource – People Australia. People Australia will mesh its own records with biographical data from a variety of outside sources, creating a rich collection of interlinked identities. Already entries from the Australian Dictionary of Biography have been ingested.

So now, thanks to People Australia, if I ever get confused about who I am I just have to remember one little url – my very own persistent identifier – http://nla.gov.au/nla.party-479364. I’m going to get a t-shirt made up.

But that doesn’t help our new machine overlords very much. How can a computer tell that the words ‘Tim Sherratt’ describe a person and that more information about that person can be found at http://nla.gov.au/nla.party-479364? This is the sort of problem that the semantic web hopes to solve. The semantic web aims to expose the structures that are buried in our documents and databases, to make explicit the contextual clues that humans pick up, but computers ignore. As the slogan goes, it represents a change from a ‘web of documents to a web of data’.

The semantic web uses a variety of tools and standards to encode information in a form that means something to computers. FOAF (Friend of a Friend) is, for example, a machine-readable ontology that describes people and their relationships. A computer visiting this page can in fact find out a fair bit about me, including my NLA persistent identifier, because there is a link to a small XML file in which my details are expressed using FOAF.

But if this seems a little daunting, the semantic web offers another technology which is really just as easy as marking up a page in HTML – it’s called RDFa. This link – Tim Sherratt – is more than it seems. Here is what a computer sees:

<a typeof="foaf:Person" property="foaf:name" content="Sherratt, Tim" rel="foaf:isPrimaryTopicOf" href="http://nla.gov.au/nla.party-479364">Tim Sherratt</a>

This says that Tim Sherratt is a person whose name has the standard form ‘Sherratt, Tim’ and who is the primary topic of the page to be found at http://nla.gov.au/nla.party-479364. There’s a fair bit of semantic goodness in that one little link. If the NLA page also expressed its data in a machine-readable form, this link could send search engines and browsers into a whole new world of associations and inferences.

But I suppose you’re thinking that the code still looks a bit complicated. Well never fear, this long post is really just an introduction to a new project I’ve been working on – something that will help you generate markup like this with just a couple of clicks.

Introducing Wragge’s identity browser

I’ve been interested in publishing biographical data way back from the early days of Bright Sparcs and, sad as it may seem, I find the possibilities of People Australia pretty exciting. However, I don’t think we should expect the NLA to do all the work. People Australia provides a framework that we can all use to enrich our own documents, databases, finding aids, and applications.

You can easily access People Australia data through Trove. But to get a better idea of what’s in the database, I’d suggest you spend some time playing with its SRU interface. Using this you can query the database directly, retrieving results in XML – ready for your own application to suck up and use.

To make this even easier, I’ve written a People Australia client library in Python. This enables you to quickly extract and use identity information. Using it, your own web application can talk to People Australia directly. I won’t go into the details here – the code is farily heavily commented – but I welcome any feedback, suggestions or contributions. Copy it, change it, use it!

To try out my library and to provide a tool that might be of use to the average punter I’ve also built:

<TA-DA>Wragge’s identity browser!</TA-DA>

It’s pretty simple. Search for a surname, pick a name from the result list, and view their identity details. For example, here’s Clement Wragge’s details.

But there are a couple of extra features that I am rather smugly pleased with. First of all, there’s an 'Identify me!' bookmarklet. Just drag the link to your browser’s bookmarks or favourites toolbar (see below for some further notes).

Once you have the bookmarklet installed all you have to do to find the identity record for someone is to highlight their name on a webpage and click ‘Identify me!’. You could then grab the People Australia ID to store in your own application, allowing you (with the help of my client library) to automatically include links to relevant entries in the Australian Dictionary of Biography, for example.

Even better, Wragge’s identity browser will automagically generate the RDFa markup you need to semantically enrich your document. Whether you’re writing a blog post, publishing an article, drafting a caption, creating a database entry, or preparing a finding aid you can quickly and easily find an individual and then cut and paste the code you need.

To show this in action I used the bookmarklet to help me mark up many of the people named in one of my articles. We humans see a normal page with a few extra links. Computers, however, can extract the embedded RDFa to get at the structured information that’s hidden in the page.

Now I’ve got to go and semantify the rest of my articles…

Go forth and identify! And in the process help build a better web.

Notes on the bookmarklet

  • Internet Explorer has ‘Favorites’, Firefox has ‘Bookmarks’ – whatever you’re using first make sure that your Bookmarks/Favourites toolbar is visible. Look under Tools->Toolbars in IE8, View->Toolbars in Firefox.
  • Try dragging the ‘Identify me!’ link to your Bookmarks/Favourites toolbar. If it doesn’t work, try right clicking on the link and choose ‘Bookmark this link’ or ‘Add to Favourites’. Make sure you add it to the toolbar folder. IE will probably give you various warnings – ignore them.
  • You should now have a working bookmarklet – highlight a name and click on it, a new window should open with results from Wragge’s identity browser. IE might complain about opening a pop-up – allow pop-ups and try again.
  • The bookmarklet is pretty clever about working out which part of the highlighted text is the surname, so you can highlight names in a number of formats including:
    • Surname
    • Surname’s
    • Surname, Othernames
    • Othernames Surname
    • Othernames Surname’s
  • For the moment this only works with ’straight’, ie non-curly, apostrophes – but I’ll fix this asap. Fixed!

Notes on RDFa markup

  • You have a choice between visible (ie clickable) links or invisible ones. They look the same to computers, so it’s just a matter of whether you want your human visitors to see them. Click ‘change’ to toggle between the two options.
  • You can just paste the RDFa markup straight into your document. If you’ve used the bookmarklet, the text you highlighted will be automatically inserted as the link text – so just copy and paste. If you haven’t used the bookmarklet you can insert the link text yourself.
  • Somewhere in your document you need to tell computers what the FOAF in your RDFa markup means. You do this by inserting the text:
    xmlns:foaf="http://xmlns.com/foaf/0.1/" inside a tag that contains your marked up text. If you can edit the raw html of your page, you can just insert it in the <html> tag itself, so it becomes <html xmlns:foaf="http://xmlns.com/foaf/0.1/" >. Otherwise you can wrap your marked up text in a <div> tag and put the extra code in there.
  • If you’re using something like Wordpress that strips out or converts any markup that it doesn’t expect, you need to be able to enter the RDFa as ‘raw’ html. In Wordpress you can do this using the Raw HTML plugin.
  • For more on using RDFa have a look at: RDFa for HTML Authors.

Harvesting context #1: Flickr comments

Instead of idly waiting for visitors to stumble over their holdings on some lonely information by-way,  archives are starting to push their content out into the bustling metropolis of the social web. They are going where the people are. Photographic collections, in particular, are gaining new lives and new audiences thanks to Flickr.

But that’s only part of the story. Released into the wild, these photos are slowly picking up the habits of the locals. They are making friends, building connections, even speaking with new accents and dialects. Commented, tagged, organised, linked – they are building new contexts for themselves outside of the cloying control of archival descriptive systems.

Unfortunately it seems there is often a chasm between the old lives of the photos, documented in databases and finding aids, and their new post-institutional careers. This is a pity because the new contexts they are gathering can help us both understand and find them. What can we do to overcome this divide? How could finding aids harvest and display the user-generated content that aggregates around collection items living in the outside world?

The good news is that the tools to start doing this already exist – Flickr has a powerful API that makes it easy to extract photo metadata. Time for a bit of experimenting… Continue reading »

Cloudy biographies and portrait walls

With a bit of time to play over Christmas I had a go at applying some of the techniques described at ProgrammingHistorian to the ADB Online.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had to offer as a way of improving access to the articles.

So I set about learning Python and was soon downloading and scraping the more than 10,000 articles that make up the ADB online.

My first tests revealed that the most frequent words in ADB articles were…

born and died

Who’d have thought it? In a biographical dictionary?

After further refining the stopwords list I started to generate some useful clouds. Finally after 147 minutes of processing time, I had a word cloud representing the content of all 16 volumes of the Australian Dictionary of Biography.

The complete ADB word cloud

The complete ADB word cloud

Continue reading »