Exposing the archives of White Australia

I recently gave a presentation in the Institute of Historical Research’s Digital History Seminar series. The time difference between London and Canberra was a bit of a challenge, so I pre-recorded the presentation and then sat in my own Twitter backchannel while it played. For the full podcast information go to HistorySPOT. You can also play with my slides or peruse the #dhist Twitter archive.

Exposing the Archives of White Australia from History SPOT on Vimeo.

Bus trips and building

Last week I took my daughter to Sydney so she could attend a girls-only Minecraft workshop at the Powerhouse Museum (they created some wonderful things). It was a 3½ bus journey each way, so to keep myself occupied I set myself the challenge of trying to build something en route. I made a fair bit of progress, but ultimately failed. I had to steal a few extra hours this week to get it to the point where people might find it useful.

The Australian WWI Records Finder

The Australian WWI Records Finder

So here it is — a (sort of) aggregated search interface to records about Australian First World War service personnel. Give it a name and it will search:

It’s ‘sort-of’ aggregated because it’s really just a series of separate searches presented on the one page. But even this should make it easier for people to match up records across the different data sets.

Using

Type in a family name and, optionally, a given name or a service number. Hit search. Wait. Wait a bit more. The National Archives’ RecordSearch database can often be pretty slow. Eventually though, each of the databases will be queried in turn and the results added to the page.

Once the results have loaded, click on a title and the little spinny thing will start up again as more details are retrieved from the database. In this ‘detail’ view, all the other results from the database are hidden. This makes it a bit easier to compare records across databases. Just click on the title again to go back to the ‘list’ view.

If your search returns lots of results, you can use the ‘next’ and ‘previous’ links to explore the complete set. They’ll all load in the current page via the magic of AJAX.

It’s not obvious from the interface, but you can feed query parameters directly via the url. For example try http://wraggelabs.com/ww1-records/?family_name=wragge. Why is this useful? Perhaps you’ve got your own database of names on the web. Using this you could easily create links from each name that looked for relevant records in the Finder.

That’s about it. It’s just a quick, bus-trip-inspired experiment, so there are many limitations and future possibilities.

Limitations

<!–INSERT USUAL WARNING ABOUT THE FRUSTRATIONS OF SCREEN SCRAPING–>

I’m just using the standard search interfaces of the various databases and screen-scraping the results. Unfortunately they all work slightly differently. For example, the AWM databases don’t distinguish between family names and given names, so if you search for the family name ‘Smith’ you’ll also get results like ‘Jones, Bruce Smith’. The CWGC database, on the other hand, will only match an other name if it comes first, while RecordSearch (or more strictly NameSearch) will also match the names of next-of-kin. Fun fun fun.

I figure anything is better than nothing, but if you’re not getting the results you expect head off to the original interfaces and try your luck there. I’m making no promises.

You’ll also notice that the maximum number of results for each data source varies. The CWGC returns 15 results, while the AWM hands over a whopping 50. These are just the default settings for the original search engines. I could’ve fiddled with the settings, but it didn’t really seem worth it.

And oh yeah… screen scraping… inherently fragile… might fall over and die at any minute.

Possibilities

As you may have guessed from previous posts, I rather like making connections. This experiment grew out of the work I’m doing on the ‘Doing Our Bit’ project with the Mosman Library. I’ve been building a series of forms that will make it easy for contributors to link people in the Mosman project to any of these databases. Just paste in a url from RecordSearch and the system will automagically retrieve all the file metadata and also check for an entry in Mapping our Anzacs. It’s pretty nifty. But of course it made me think about having a way to search across all these different databases.

And then what?

Having found a series of records for an individual it would be good if they could then be permanently linked. If I had the time and money to do more work on this, I’d want to allow people to save the connections they find. And of course then expose these connections as Linked Open Data. It wouldn’t be difficult.

There’s probably also a lot more that could be done with machine matching of records. Perhaps someone’s already working on this for the centenary — it seems like an obvious point of attack. It would be good if the forthcoming centenary commemorations resulted in something that brought all these datasets together and exposed identifiers that could be easily used by community projects like ‘Doing Our Bit’.

Details

Yes, I cheated. I had already done a lot of work on the screen-scrapery bits of this pre bus trip. I’ve been working a RecordSearch client on and off for a while to use with projects like Invisible Australians. The AWM and CWGC scrapers I wrote for ‘Doing Our Bit’. Feel free to grab the code and play.

The actual application was built using the Python micro-framework Flask. I’m a big fan of Django, but there’s a lot of overhead involved if you just want to throw together a simple app. I’ve been wanting to try Flask for a while and was pleased to find just how quick and fun it was to get something up and running.

To make the whole thing as responsive as possible, the search results are retrieved using AJAX calls to simple APIs I built in Flask on top of my screen scraper code. There’s actually very little code in the Flask app itself. The downside of this is that the Javascript is a bit of a mess. Ah well.

Next

I don’t know whether I can put any more time into this at the moment — too many other projects competing for my time and no more bus trips coming up. But if you think it’s useful or worthwhile please let me know and I’ll see what I can do.

At the very least it shows how with just a little impatience and ingenuity we can find fairly simple ways to integrate records from a variety of sources. We don’t have to wait for some centralised solution.

2012 — The Making

I obviously did a lot of talking in 2012, but I also made a few things…

The evolution of QueryPic

Screen Shot 2012-09-27 at 12.08.28 AM

Try QueryPic

At the start of 2012 QueryPic was a fairly messy Python script that scraped data from the Trove newspaper database and generated a local html file. It worked well enough and was generously reviewed in the Journal of Digital Humanities. But QueryPic’s ability to generate a quick visualisation of a newspaper search was undermined by the work necessary to get the script running in the first place. I wanted it to be easy and accessible for everyone.

Fortunately the folks at the National Library of Australia had already started work on an API. Once it became available for beta testing, I started rebuilding QueryPic — replacing the Python and screen-scraping with Javascript and JSON.

In the meantime, I headed over the New Zealand for a digital history workshop and began to wonder about building a NZ version of QueryPic based on the content of Papers Past, available through the DigitialNZ API. The work I’d already done with the Trove API made this remarkable easy and QueryPic NZ was born.

Once the Trove API was publicly released I finished off the new version of QueryPic. Instead of a Python script that had to be downloaded and run from the command line, QueryPic was now a simple web form that generated visualisations on demand.

The new version also included a ‘shareable’ link, but all this really did was regenerate the query. There was no way of citing a visualisation as it existed at a certain point in time. If QueryPic was going to be of scholarly use, it needed to be properly citable. I also wanted to make it possible to visualise more complex queries.

And so the next step in QueryPic’s evolution was to hook the web form to a backend database that would store queries and make them available through persistent urls. With the addition of various other bells and whistles, QueryPic became a fully-fledged web application — a place for people to play, to share and to explore.

Headlines and history

Explore The Front Page

Explore The Front Page

Back in 2011 I started examining ways of finding and extracting editorials from digitised newspapers.  Because the location of editorials is often tied up with the main news stories, this started me thinking about when the news moved to the front page. And of course this meant that I ended up downloading the metadata for four million newspaper articles and building a public web application — The Front Page — to explore the results. ;-)

The Front Page was also the first resource published on my new dhistory site (since joined by the Archives Viewer and QueryPic). dhistory — ‘your digital history workbench’ — is where I hope to collect tools and resources that have graduated from WraggeLabs.

Viewing archives

Try Archives Viewer

Try Archives Viewer

In 2012 I also revisited some older projects. After much hair-pulling and head-scratching, I finally managed to get the Zotero translator for the National Archives of Australia’s RecordSearch database working nicely again. I also updated it to work with the latest versions of Zotero, including the new bookmarklet.

My various userscripts for RecordSearch also needed some maintenance. This prompted me to reconsider my hacked together alternative interface for viewing digitised files in RecordSearch. While the userscript worked pretty well, there were limits to what I could do. The alternative was to build a separate web interface… and so the Archives Viewer was born.

Stories and data

Expect bugs ye who enter here...

Expect bugs ye who enter here…

 

In the ‘work-in-progress’ category is the demo I put together for my NDF2012 talk, Small stories in a big data world. Expect to see more of this…

My favourite things

Two things I made in 2012 are rather special (to me at least). Instead of responding to particular needs or frustrations, these projects emerged from late night flashes of inspiration — ‘what if…?’ moments. They’re not particularly useful, but both have encouraged me to think about what I do in different ways.

Play!

Play!

The Future of the Past is a way of exploring a set of newspaper articles from Trove. I’ve told the story of its creation elsewhere — I simply fell in love with the evocative combinations of words that were being generated by text analysis and wanted to share them. It’s playful, surprising and frustrating. And you can make your own tweetable fridge poetry!

Screen Shot 2012-07-10 at 5.20.45 PM

The People Inside

One night I was thinking about The Real Face of White Australia and the work I’d done extracting photos of people from the records of the National Archives of Australia’s database. I wondered what would happen if we went the other way — if we put the people back into RecordSearch. The result was The People Inside – an experiment in rethinking archival interfaces.

 

2012 — the talking

In an attempt to try and figure out where this year went I’ve pulled together a list of my talks, presentations and workshops for 2012…

7 January 2012 — ‘Invisible Australians: Living under the White Australia Policy’, contribution to the Crowdsourcing History: Collaborative Online Transcription and Archives panel, American Historical Association annual conference, Chicago. [slides]

8 January 2012 — ‘Making friends with text mining’, contribution to the A Conversation about Text Mining as a Research Method panel, American Historical Association annual conference, Chicago.

10 January 2012 — ‘Collections, interfaces, power and people’, McGill University.

12 January 2012 — ‘Collections, interfaces, power and people’, University of Western Ontario.

7 February 2012Mining the treasures of Trove: new approaches and new tools, VALA2012.

23 March 2012 — ‘Mining Trove’, Digital History Workshop, Victoria University of Wellington.

29 March 2012 — ‘Inside the bureaucracy of White Australia’, Digital Humanities 2012, Canberra. [slides]

8 May 2012Mining for meanings, Harold White Fellowship Lecture, National Library of Australia, Canberra.

27 June 2012 — ‘Beyond the front page’, combined meeting of the Canberra Society of Editors and the Australian and New Zealand Society of Indexers, Canberra. [slides]

19 July 2012 — ‘The responsibilities of data’, Framing Lives: The 8th Biennial Conference of the International Auto/Biography Association, Canberra. [slides]

11 August 2012, Doing Our Bit Build-a-thon, Mosman Library.

12 October 2012Digital disruptions: Finding new ways to break things, Faculty of Arts eResearch Forum, University of Melbourne.

19 October 2012Too important not to try, Dipping a toe into Digital Humanities, Deakin University.

25 October 2012 — Digital disruptions: Finding new ways to break things, Australian National University.

1 November 2012 — Digital disruptions: Finding new ways to break things, Digital Humanities Symposium, University of Queensland.

13-15 November 2012Digital dimensions: A hand-on workshop for the DH curious, University of Queensland.

20 November 2012Small stories in a big data world, National Digital Forum, New Zealand.

22 November 2012, Learning how to break things, workshop at THATCamp Wellington. [outline]

29 November 2012Archives of emotion, Rethinking Archival Methods workshop, Sydney.

12 December 2012 — ‘Introducing Digital Humanities’, State Library of New South Wales.

Archives of emotion

Presented at the Reinventing Archival Methods workshop, 29 November 2012, in Sydney.

One weekend, a bit over a year ago, I built this — a wall of faces of people forced to live within the restrictions of the White Australia Policy, drawn from records held by the National Archives of Australia. It created a lot of interest, both here and overseas, particularly after I talked about it at the 2011 National Digital Forum in New Zealand.

My original post was republished in South Africa, my NDF talk made it into the inaugural edition of the Journal of Digital Humanities. The wall is being studied as part of a digital history course in the US, and was cited by two papers at the Museums and the Web conference this year. It’s also been referenced in discussions on visualisation, serendipity and race.

But perhaps most important was the email we received in which the sender described scrolling through the wall with tears rolling down their face.

It’s also important to note that the project of which the wall forms part — Invisible Australians — is completely unfunded and has no institutional home. It’s a project driven by passion. It’s a project born out of the sense of obligation and responsibility that my partner, Kate Bagnall, and I feel towards the people whose lives are documented in the archives.

Last week I was at NDF 2012, where Courtney Johnston called on us to consider the emotional landscapes in and around our collections. So it started me wondering, what is the role of emotion in the archives?

There is clearly no neutral position. In Archival Methods David Bearman rightly criticises the idea that the value of archivists lies in their political disengagement — as faithful guardians of the accumulated past. And of course archival writers like Verne Harris and Terry Cook have developed this critique in some detail.

Bearman suggests that archives can instead be seen as ‘marshaling centers’, that enable people not to observe some distant past, but to mobilise the past within their own lives — to find connections and meanings.

Recently I was talking to an academic researching the role of historical thinking in education. He argued that an emotional connection had to come first. Only then could rational arguments take root — only then could opinions, ideas and lives be changed.

And yet emotion still seems like something best avoided in public. We try not to ‘inflame’ it, we rarely seek to nurture it. Exposing the rawness of emotion is often seen as cheap or manipulative. And yet it happens, always, in and around our cultural collections.

What user or worker in archives has not been moved? By the voices and stories contained within the records, by the sheer excitement of discovery, or perhaps by the overwhelming burden of responsibility. If as Bearman argues, ‘the pasts we construct are all discussions with the present’, then these discussions are infused with joy and anger, with fear and longing, with sadness and gratitude.

Why are we so reluctant to acknowledge that archives are repositories of feeling? Is emotion meaningless because it can’t be quantified, dangerous because it can’t be controlled, or does it simply not fit with the professional discourse of evidence, authority and reliability.

As our experience of archives moves further into the online realm, so the possibilities for making emotional connections increases — simply because it’s so much easier to share. From the like button or the retweet, through to a lovingly-tended personal collection in something like Pinterest — we have new opportunities to explore what’s important to us and why.

This is happening now. Voices from the past are finding their way into online conversations. But what voices and whose conversations? Even as welcome this sort of engagement we have to remember what is not online, what is not accessible, and all the social, technical and political barriers that can prevent someone from joining the discussion.

It worries me too that our emotional connections may be too small, too fragile to survive in the world of big data. We live in a age where our online preferences are monitored, our sentiments analysed — our feelings are harvested and tallied in order to sell us more stuff. The line between expression and consumption is increasingly blurred.

Back in the pre-web era, Bearman imagined access to archives through ‘intelligent artifices’ that would bridge databases and connect vocabularies — responding to, and learning from the activities of users. Twenty-five years later we’re exploring these possibilities at a global scale, through Linked Open Data.

While Linked Open Data is often described like a giant plumbing project, it’s really about making a whole lot of very small connections. To me it offers an opportunity to fight back against the homogenisation of data. We can use it to express complex relationships with the past. But we need to know how, and we need to find the points at which we can plug ourselves in.

Perhaps these are Bearman’s ‘marshaling centers’, short-circuiting our online connections to jack us into the past. Not a fixed or nostalgic past, but a challenging and contested past, both real and yet unknowable. As feeling becomes commodified and neutered through a variety of online filters, perhaps archives can hack us directly into powerful conduits of meaning and emotion.

How might this happen? There’s the technical stuff — persistent identifiers, blah, blah, blah — vitally important of course. But then there’s the relationship stuff. We have to stop talking about users and start talking about collaborators. We need to stop building services to be consumed, and start opening opportunities to create, to play, to break and to hack. We are all making connections.

Most importantly we need to find and support the people, both inside and outside our organisations, who are driven by passion. The people who care. The people who simply give a shit.

Small stories in a big data world

Presented at the National Digital Forum, Wellington, 20 November 2012. You can also watch the video.

Previously at NDF:

As we return to the action, Tim is wondering what happens when we bring stories and data together…

As historians, as cultural heritage professionals, as people — we make connections, we make meanings. That’s just what we do.

What really excites me about Linked Open Data is not the promise of smarter searches, but the possibilities for making connections and meanings in ways that are easier to traverse — to explore, to wander, to linger, or even to stumble.

What really frustrates me about Linked Open Data is that we still tend to talk about it as if it’s all engineering — an international plumbing project to pump data around the globe. Linked Open Data doesn’t have to be an industrial undertaking, it can be a craft, a mode of expression. It can be created with love or in anger.

And anyone can do it.

I’m currently working on a project with the Mosman Library in Sydney to collect information about the World War I experiences of local service people. The web resource we’re building will provide Linked Data all the way down. Every time someone adds a story about a person, uploads a photograph, identifies a place, or includes a link to another resource, they will be minting identifiers, creating relationships, documenting properties — sharing their knowledge as Linked Open Data.

It seems to me that Linked Open Data will be a success not when we’ve standardised on a few vocabularies, or linked everything we possibly can to DBpedia, but when have thriving online communities creating and sharing structured data about the things that are important to them. Not just the known and notable, but the local, the contested, the endangered, the ephemeral and the oppressed.

Many of us live within a Western tradition which equates knowledge with accumulation. Linked Open Data promises new means of aggregation, new powers of discovery — lots and lots more stuff! But it would be a tragedy if all we ended up with was a bigger database or a better search engine. I want more. I want new ways of using that data, of playing with structures and scales. I want to build rich contexts around my stories.

Last year I talked about this in a keynote I gave to the Australian and New Zealand Society of Indexers. To try and demonstrate some of the possibilities, I created a fancy presentation and added a whole lot of linked data to the text of my talk. But it was a bit of a cheat. The text, the triples and the presentation were still pretty much separate. What I really wanted to do was use the linked data to generate alternative views of the text, to take my story and look at it through a variety of linked data powered filters.

So for NDF this year I thought I’d have another go. I set myself a few groundrules:

  • Simple tools — should be possible for anyone with a text editor.
  • No platforms — no sneaky server-side stuff, it all had to happen in the browser, on the fly.
  • No markup madness — I wanted there to be a close relationship between the text and the data, but I wanted the markup process to be practical — something like creating a footnote.

So I hacked together a whole lot of existing Javascript libraries. I used them to extract all the triples from my text and follow external identifiers to get extra information. Then I queried the little databank I’d made to generate four different views of my talk…

WARNING WARNING! Very early demo! Expect bugs and general stupidity!

Now, none of this looks terribly exciting. Visually the various components look pretty familiar — and that’s part of the point, I’m showing how you can re-use existing tools and code libraries.

What’s interesting, I think, is the dialogue that’s evolving between text and data — a dialogue that’s taking place within one, just one, html document.

Expect bugs ye who enter here…

So here’s the text of my talk to the indexers last year. As you scroll through the document, each paragraph on the screen is examined and information about related entities — people, places, events, objects — are displayed in a sidebar. The text and the sidebar are linked, so if you click on a link in the text more information about the related entity opens in the sidebar.

If you want to look at the resources separately you can. You can re-order, and filter by type.

Then there’s the fairly traditional timeline and map views.

Most of the data that’s being displayed is coming from RDFa within the document, but not all. There are links to GeoNames and DBPedia that are drawing in data on the fly. As more Linked Open Data becomes available these links can become deeper and richer.

It’s a very rough demo and I have a long to-do list — for example better links between the data views and the text, showing their context within the narrative. But hopefully you can get an idea of how it might be possible to build data-rich stories — with layers and views that enrich, inform and engage with the narrative.

And all just with one html page, a bit of RDFa and a few Javascript libraries.

There’s no magic.

You might be wondering about my ground-rules — why did I constrain myself? Well, it has to do with this thing we call ‘access’. Oftentimes when we talk about access we mean the power to consume — the power for people to take what they’re given.

But to really have access, for something to be truly open, people also have to have the power to create. To take what they’re given and build something new — to challenge, to criticise, to offer alternatives.

That means allowing people the space to have ideas, giving them the confidence to experiment, providing useful tools and the knowledge to use them. That’s not a job for any particular institution, or sector, it’s a challenge for all of us who build things to strip away the magic and invite others to join in.

And I think it’s pretty important. I don’t really want to live in a world where data is just something that other people collect about and for us. I want slow data, as Chris described last year. I want us to enjoy the textures and tastes and not get addicted to the processed product. I want to create, enrich, wield and wonder.

So my vision of the future of Linked Open Data, is not of the Giant Global Graph linking all knowledge. But a revolutionary army of data-artisans, hand-crafting their richly contextualised stories into a glorious, messy, confusing, infuriating, WONDERFUL tapestry.

Now I know you’re all just waiting for me to press the BOOM! button.

So let’s blow some shit up!

Teaching by example?

There’s been plenty of discussion within the digital humanities community about the difficulty of getting academic recognition for digital projects. But what about being recognised for alternative forms of teaching? I don’t mean online courses, I mean the sort of peer-to-peer teaching that takes place through blogs, or Twitter, or the comments in our code. We all learn from each other.

I’ve been thinking about this while working on a few job applications recently. My opportunities for formal teaching or supervision have been limited, but over the last few years I’ve worked hard to introduce the digital humanities to a broad range of audiences. I’ve given talks to all sorts of professional and community groups, including librarians, museum curators, archivists and family historians. I’ve organised a couple of THATCamps. I’ve given papers at disciplinary conferences. I’ve blogged about my experiments and my frustrations. I’ve created a series of digital tools and made them available for all to use. Most recently I’ve been visiting universities giving talks and workshops to help staff and students make use of digital tools and resources in their own research. But I don’t ‘teach’ — or do I?

Most of this work is unpaid of course. I do it because I love it, and because I think it’s important. I do it because I want DH to live up to it’s promise of being open and engaging — I want others to share the excitement, the possibilities and the power. Sometimes it’s hard to know if it really makes any difference — usually I only hear anecdotally about the way my tools are used. But when I do receive feedback from people it’s often to say how I’ve ‘inspired’ them.

It seems to me that the ability to teach by example, to broaden horizons, and offer inspiration, is something that should find a place in a job application, but where? As I was pondering this the other night I fired off an idle tweet that brought a couple of encouraging responses:

So I’ve adopted @ProfessMoravec’s suggestion and created a Testimonials page. If I’ve managed to inspire or assist you in some way, feel free to leave a comment. Maybe next time I put together a job application I’ll have something to point to to demonstrate my ‘teaching’ credentials.

Too important not to try

On Friday 19 October I joined an enthusiastic group of digital humanities explorers at a Deakin University event entitled Dipping a Toe into the Digital Humanities and Creative Arts. @catspyjamasnz has assembled an excellent summary of the day in Storify.

In the morning I told the story of Invisible Australians. You can view the slides of Too or important not to try and listen to my dodgy audio recording via SoundCloud.

In the afternoon I gave a whirlwind workshop which included a headline roulette smackdown and an introduction to the wonders of Zotero.

Digital disruptions: Finding new ways to break things

Recently I gave a presentation at the University of Melbourne’s Faculty of Arts eResearch Forum. The slides for my talk, ‘Digital Disruptions: Finding New Ways to Break Things’, are available online (thanks to reveal.js). I also managed to make a fairly basic recording — I’m intending to create a transcript, but for now you’re welcome to download and listen you can listen via SoundCloud.

Basically I was arguing that as well as making stuff, digital humanities can involve a lot of stretching, twisting, pushing and breaking stuff. The web is not fixed or static, there are many points at which we can intervene and change the way information is presented. What we need is confidence to pull things apart, and the ability to critically examine why things work the way they do (or don’t). And imagine alternatives.

After my talk there were a number of interesting reports from people around the university. Brett Holman has provided a great summary on his Airminded blog, as well as doing his best to find me a job!

For you, with all best wishes…

Yep, there’s a new version of QueryPic.

About 18 months ago I created a little Python script to visualise search results in Trove’s collection of digitised newspapers. After a bit more tweaking. I christened it QueryPic. People started to use it. It was even reviewed in the Journal of Digital Humanities. With the release of the Trove API earlier this year I rewrote the whole thing in Javascript and let it loose on the web. People could make graphs without having to download any code or fire up the command line. Anyone could play.

And now?

The latest version lets you save your QueryPics. As new features go it’s not very revolutionary. But it meant another significant shift. From Python script, to web page, to web app. The Javascript-enabled interface now connects to a Django-powered backend. Save a graph and you can access it via a lovely, short, persistent url (like this). It’s as much a platform as a tool. But to be persistent, the urls need to work for ummm… a long time. Is this even possible for a project that has no funding and a support team of one?

I don’t know.

My enthusiasm for making tools is punctuated by regular bouts of doubt and disillusionment. With millions of dollars being spent on industrial-strength digital research infrastructure why should I devote my evenings to hand-crafting pretty little widgets like QueryPic?

My grandfather made this brass dish. He owned an engineering workshop and forge. My dad was a draftsman, engineer and builder. My mum made fine dresses in the fashion houses of Melbourne. I make things too. It’s what I do. It took me quite a few years to work this out. Years spent wondering why I felt out of place in academia. I’m also a historian, so I research and I write, but without some time to tinker, well… I’m just not happy. Making things is not separate — for me it’s all part of being a historian. I make things that let people connect to the past in different ways. And along the way I learn.

And by people I mean people. Just last week I took part in an online question and answer session organised by Inside History Magazine. It was a lot of fun. Amidst the questioning I unveiled the latest version of QueryPic. Considerable excitement ensued. QueryPic graphs are starting to be included in research publications, but anyone can make and understand them. Local and family historians are enthusiastic users of digital technologies and I’m excited to see them playing around with tools that I’ve made. I want to create things that other people use. Things that help them, and sometimes surprise them.

QueryPic has graduated from WraggeLabs to dhistory — my platform for digital history research. There it joins The Front Page and Archives Viewer. As usual, I have big plans. Are they practical? Probably not. Are they sustainable? I doubt it. Will I keeping making things anyway? Of course.

So please accept this gift. I made it for you. I hope you find it useful.

QueryPic — explore digitised newspapers from Australia & New Zealand.

http://dhistory.org/querypic/

Features include:

  • Save and date-stamp your graphs with persistent urls — perfect for citing and sharing
  • Copy and paste query urls from Trove or Digital NZ, or connect automatically with a handy bookmarklet
  • Easily regenerate saved graphs to draw in updated data
  • Explore QueryPics created by others — use them as the starting point for your own visualisations
  • Combine any number of queries, either from Australia or New Zealand
  • Click on the graphs to preview matching articles

All this and more documented on QueryPic’s extensive help page. Code on Github.

Old loves, new views…

I’m deeply in love with the collections of the National Archives of Australia. They move me, they inspire me, they make me want to do something. How do I express my love? I’ve written stories about things like atomic bombs, progress, astronomy and weather forecasting — pursuing lives and events documented in the Archives’ rich holdings. I work on projects like Invisible Australians, hoping to bring the compelling remnants of the White Australia Policy to broader public attention. And I build things. I make tools that help other people explore, understand and use the Archives. I do this because these riches need to be used. They need to be shared. They need to be part of the fabric of our lives.

A few years ago I created a little script for Firefox that put a fresh face on the display of digitised records in the National Archives’ RecordSearch database. It’s publicly available and has been installed more than 500 times. Demonstrating this script at the ‘Doing our bit’ Build-a-thon a few weeks ago made me realise again both how useful it was and how much work it still needed.

One of the most exciting features when I first created the script was the ability to display the records on a ’3D wall’, courtesy of a Firefox plugin called CoolIris. But CoolIris uses Flash and is no longer being supported. Time for a new approach.

Say hello to the Archives Viewer (naming things isn’t really one of my strengths). Instead of rewriting my existing script I decided to create a completely new web application. Why? Mainly because it gave me a lot more flexibility. I could also make use of a variety of existing tools and frameworks like Django, Bootstrap, Isotope and FancyBox. Standing upon the code of giants, I had the whole thing up and running in a single weekend. The code is available on GitHub.

What does it do? Simply put, just feed the Archives Viewer the barcode of a digitised file in RecordSearch and it grabs the metadata and images and displays them in a variety of useful ways. It’s really pretty simple, both in execution and design.

Yep, there’s a wall. It’s not quite as spacey and zoom-y as the CoolIris version, but perhaps that’s a good thing. It’s just a flat wall of page image thumbnails with a bit of lightbox-style magic thrown in. But when I say just, well… look for yourself. There’s something a bit magical about seeing all the pages of a file at once, taking in their shapes and colours as well as their content. This digital wall provides a strangely powerful reminder of the physical object.

National Archives of Australia: ST84/1, 1908/471-480

Of course you can also view the file page by page if you want. Printing is a snap — just type in any combination of pages or page ranges and hit the button. The images and metadata are assembled ready to print. No more wondering ‘which file did this print out come from?’.

But perhaps the most important feature is that each page has it’s own unique, persistent url. Basic stuff, but oh, so important. With a good url you can share and cite. Find something exciting? Tell the world about it! I’ve included your typical social media share buttons to help you along.

One disadvantage over the original userscript is that the viewer isn’t directly linked to RecordSearch. You probably don’t want to have to cut and paste the barcode every time you view a file. So I’ve also created a couple of connectors that ummm… connect things up.

The first connector is just a bookmarklet. A bookmarklet is just a little piece of javascript code disguised as a browser bookmark. Just drag this link — Archives Viewer — to your browser’s bookmark toolbar. Then when you’re on the item page of a digitised file in RecordSearch, just click the bookmarklet and you’ll be instantly transported to the wall.

The second connector is a bit smarter. It’s an enhanced version of another userscript I wrote to display the number of pages in a digitised file. It still does that, but now it also rewrites the links to the digitised files so that they automatically open in the Archives Viewer. It’s a bit harder to install. You need Chrome or Firefox and the add-ons Greasemonkey (for Firefox) or Tampermonkey (for Chrome). Then just go to the userscript page and hit the big ‘Install’ button.

You might be wondering about Zotero (at least I hope you are). My Zotero-RecordSearch translator lets you capture page images and metadata direct to your own research database, so what happens when you’re transported across to the Archives Viewer? Never fear, I’ve written a new translator that lets you save pages as you could in RecordSearch. Even better, you get a persistent, context-enriched url, and the ability to capture multiple pages at once. Yippee!

But that’s not quite all. Buried within the pages is some lovely Linked Open Data. To be truthful, it’s not really very ‘linked’ yet, but it does expose the basic metadata in a machine-readable form, borrowing from the vocabularies of projects like Locah and the Archival Ontology. It’s an experiment, as is the Archives Viewer itself. We can learn by doing.

I’ve given quite a few talks over recent times encouraging people to take up their tools and start hacking away at the digital collections of our cultural institutions. Yes, I admit it, I’m an impatient historian (and a grumpy one at that). But it’s also because I think it’s important that we recognise that access is never just something you’re given. It’s something that we make through our stories, our projects, and our tools. It’s something that’s grounded in respect and powered by love.

‘Doing our bit’ Build-a-thon

BUILD-A-THON

Last Saturday I was amongst a group of enthusiastic and knowledgeable volunteers getting stuck in to the ‘Doing our bit’ project at the Mosman Library. The Build-a-thon was the first stage in creating a new online resource documenting the experiences of World War I service people related to the Mosman area. We’re trying to make the whole process as open as possible, so the Build-a-thon was a way of exploring resources, issues, interfaces and ideas before we lay down too much code. You can read more on the project blog.

To provide some context for our labours, I gave a series of short talks:

  • ‘Small stories in a big data world’ [video] [links]
  • ‘A digital history toolkit’ [video] [links]
  • ‘Telling stories and building interfaces’ [video] [links]
  • ‘Connections and contexts through Linked Open Data’ [links]

You can see how the day unfolded on Storify, and view the participants hard at work on Flickr.

The people inside

[View in Storify]

A little hack to reveal faces in the archives.
Continue Reading »

4 million articles later…

On 15 April 1944 the Sydney Morning Herald turned inside out. For more than a hundred years, the front page had been dominated by advertisements, but this changed suddenly in 1944 as the newspaper took on a completely new look. In place of the ads were the day’s top stories, headlines and photographs — a ‘front page’ design familiar to modern readers.

The change was, the newspaper explained, partly a response to the demands of war. Advertising had been cut due to the rationing of newsprint and ‘an urgent public demand in these critical days for more papers and more news’. But they were also looking forward to the problems of peace:

It is essential… that we should not only provide the space, but also adopt the manner and methods of presentation which will spread knowledge of these problems yet more widely, and bring them home yet more deeply, among the people of this country.

But the Sydney Morning Herald wasn’t breaking new ground. The design of front pages had been changing across the first half of the twentieth century as advertisements gradually gave way to news. This graph shows the average number of words per issue on the front pages of Australian newspapers devoted to advertising.

You can see a clear decline from about the turn of the century. News articles, on the other hand, were on the way up.

Not all the changes were as sudden as the Sydney Morning Herald‘s. The Barrier Miner entered the First World War with the ads on top, but by war’s end the position was reversed. In between was a period of transition as you can see from this graph which plots advertising against news.

If you dig a bit deeper, you find that the amount of advertising follows a regular pattern.

These peaks and troughs in June 1916 are a week apart — Saturday’s front page was all advertising, but the next day brought a ‘Special Sunday Issue’ focused on the ‘Latest War News’.

It’s clear just from these two examples that there are stories behind these changes. There are subtleties and contingencies to be explored along with dramatic shifts.

And now you can explore them…

The Front Page

The Front Page is a database containing details of more than 4 million front page newspaper articles harvested from the National Library of Australia’s Trove service.

Trove divides articles into a series of categories:

  • articles (news)
  • advertising
  • detailed lists, results, guides
  • family notices
  • literature

I’ve simply gone through and added up the numbers of articles and the numbers of words in each category for each issue, and aggregated this across months, years and the full run of each newspaper.

These totals are presented as a series of linked tables and graphs. Just click on a point to zoom in, or use the navigation controls to go directly to the issue of your choice. It’s pretty straightforward.

Why?

We’re lucky to have rich resources like Trove, but if we’re going to make best use of them we have to move beyond the search box to find new ways of exploring and contexualising their content. That’s why I’ve developed tools like QueryPic, Headline Roulette and even The future of the past. Each lets you engage with the newspaper database in a different way.

But not all newspaper articles are created equal. I’d like to be able to aggregate and analyse the ‘top’ stories for each day, but to do this I need to know more about the structure of the newspapers themselves. I’ve already made a few attempts to find and extract editorials. This is useful because before the main news moved to the front page it was often directly after the editorials. But when did the news shift to the front page?

Now I can find out.

But why create a public web resource? Well, it’s just what I do. I build and I share. It’s what motivates me. It’s how I understand things. It’s where I find both my questions and my answers. Hey, I’m a digital humanist ok?

How?

Everything’s up on GitHub, so you can follow along with my ugly coding. It was all a bit of an experiment, because I simply didn’t know whether I could harvest and use 4 million articles. How long would it take? Would MySQL grind to a halt? Would my laptop blow up?

In my Harold White lecture I wondered whether what I was trying to do was really beyond the reach of ‘an ordinary bloke and his laptop’. I suspect the day is rapidly coming where my work will be superceded by well-funded academic projects with access to supercomputers and a pool of bright young graduate students. But for now I’ll just keep pushing the boundaries of what’s possible over a dodgy home broadband connection.

Of course, this project was only possible because of the Trove API. My screen-scrapers of yore would have been impossibly slow and wasteful of bandwith. With the API I could simply construct a query and then loop through the 4 million articles in batches of a hundred. These were then fed into MySql via Django. I quickly worked out that I needed to keep my Django models simple. My clever relational model linking newspapers, issues, pages and articles was just too complex for this sort of operation. I flattened everything out to store all the metadata in a single ‘article’ model.

The harvesting operation took about 5 days. Once I had all the metadata I ran a couple of processes to do all the adding up and saved the results to a separate ‘totals’ table.

Then it was just a matter of building a front end. Using Django, Twitter Bootstrap and HighCharts made this amazingly easy. Really. Really truly.

What now?

I built this because I wanted to track changes in the design of front pages, but now I’m wondering what else I can find. The role of war in the examples above is intriguing. Are there other changes in our relationship to ‘news’ that these graphs might reveal?

I hope other people will wonder about this as well.

I have some ideas for future developments. For example, I’d like to add tagging to make it easy to construct timelines of significant changes. But first I just want to see if anybody’s actually interested. If you have any ideas, suggestions or comments please let me know.

Ok, off you go — explore.

The future of the past

[view on Storify]

This is a story about a thing I made. I’m still not sure what to call it. Or what it’s really for.

But I like it.

And I hope other people will too…
Continue Reading »

Topic modelling in the archives

There seems to be a lot of topic modelling going on at the moment. Any why not? Projects like Mining the Dispatch are demonstrating the possibilities. Tools like Mallet are making it easy. And generous DHers like Ted Underwood and Scott Weingart are doing a great job explaining what it is and how it works.

I’ve talked briefly about using topic modelling to explore digitised newspapers, something that the Mapping Texts project has also been investigating. But I’ve also been following with interest Chad Black’s use of algorithmic techniques, including topic modelling, to look for local variations amidst the legal system of the early modern Spanish empire.

As part of the Invisible Australians project, Kate and I are exploring the bureaucracy of the White Australia Policy. In particular, we’re interested in the interaction between policy and practice, between the highly-centralised bureaucracy and the activities of individual port officials. Like Chad, we’re interested in mapping local variations — to try and understand the bureaucracy from the point of view of an individual forced to live within its restrictions.

I recently gave a presentation about the project at Digital Humanities Australasia (post coming soon!), and in preparation I decided to try a few topic modelling experiments. They were very simple, but I was impressed by the possibilities for exploring archival systems.

The problem I started with was this. The workings of the White Australia Policy are well documented by records held by the National Archives of Australia. Some series within the archives are specifically related to the operations of the policy — such as those containing many thousands of CEDTs. But there are also general correspondence series created by the customs offices in each state, as well as the Commonwealth Department of External Affairs which administered the Immigration Restriction Act (responsibility was later taken by the Department of Home and Territories and it’s successors). These general correspondence series are important, because they often include details of difficult or controversial cases — those that required a policy judgment, or prompted a change in existing practices. But how do you find relevant files within series that can contain large numbers of items?

Series A1, for example, is a correspondence series created by the Department of External Affairs. It contains more than 60,000 items. Past research tells us that amongst these 60,000 files are records of important policy discussions relating to White Australia. But these files tend to be labelled with the names of the people involved, so unless you know the names in advance they can be difficult to find.

Mitchell Whitelaw’s A1 Explorer, part of the Visible Archive project, lets you to explore the contents of Series A1 in a easy and engaging way. But while the A1 Explorer provides new opportunities for discovery, it doesn’t offer the fine-grained analysis we need to sift out the files we’re after. And so… topic modelling.

The process was pretty simple. While I can dip into my bag of screen-scrapers to harvest series directly from the NAA’s RecordSearch database, there was already an XML dump of A1 available from data.gov.au. So I extracted the basic file metadata from the XML and wrote the identifiers and titles out to a text file, one item per line. Following the instructions on the website I then loaded this file into Mallet:

/Applications/Mallet/bin/mallet import-file --input ./A1.txt --output A1.mallet --keep-sequence --remove-stopwords

Then it was just a matter of firing up the topic modeller:

/Applications/Mallet/bin/mallet train-topics --input ./A1.mallet --output-state ./A1.gz --output-doc-topics ./A1-topics.txt --output-topic-keys ./A1-keys.txt --num-topics 40

Again, I just followed the examples on the Mallet site.

Once it was finished I opened up A1-keys.txt to browse the ‘topics’ Mallet had found. The results were intriguing. There are a large number of applications for naturalisation in A1, so it’s no surprise that ‘naturalisation’ figures prominently in a number of the topics. What was more interesting was the way Mallet had grouped the naturalisation files. For example:

naturalization christian hans hansen jensen petersen andersen nielsen larsen christensen johannes jens niels pedersen andreas johansen martin jorgensen

and

naturalisation certificate giuseppe salvatore frank la leo samios spina sorbello leonardo fisher natale patane torrisi barbagallo luka rossi ross

Based on the co-occurrence of names within the file titles, Mallet had created groupings that roughly reflected the ethnic origins of applicants. It makes sense when you think about what Mallet is doing, but I still found it pretty amazing.

Mallet also found clusters around the major activities of the department, such as the administration of the territories. But of most interest to us was:

1 0.55539 passport ah student exemption students lee wong chinese young deserter education sing wing chong readmission son hing chin wife

The Chinese names alongside words such as ‘readmission’ and ‘wife’ suggested that this topic revolved around the administration of the White Australia Policy. This was easy to test. In A1-topics.txt was a list of every file in the series and their weightings in relation to each of the topics. I wasn’t sure what was a reasonable cut-off value to use in assessing the weightings, but after a bit of trial and error I fixed on a value of 0.7. I then just extracted the identifiers of every file that had a weighting greater than 0.7 for this topic. I used the identifiers to build a simple web page that Kate and I could browse. I also included links back to RecordSearch so we could explore further.

Browse the full list

It’s a pretty impressive result. Instead of fumbling with the uncertainties of keyword searches, we now have a list of more than 1,300 files that are clearly of relevance to Invisible Australians. There’s a few false positives and there are likely to be other files that we’ll have missed altogether, but now we have a much clearer picture of the types of files that are included and how they are described.

And that was at my first attempt, simply using the default settings. I’m now starting to play around with some of Mallet’s configuration options to see what sort of difference they make. I’m also keen to try out GenSim, a topic modelling package for Python.

I’m really excited about the possibilities of these sort of tools for analysing the contents of archival descriptive systems, something I mentioned in my Digital Humanities Australasia paper. Much more to come on this I suspect…

Local heroes

Earlier this week it was announced that the Mosman Library had been awarded a Library Development Grant for an innovative project that aims to document stories and artefacts relating to the First World War. I’m very excited to be part of it. As well as working with the local community in the creation of a new resource, the project offers an interesting opportunity to explore how we can link in with the ever-increasing volume of WWI material being published as linked data around the world.

But thinking about this new project has also made me reflect again on the creation of Mapping Our Anzacs — a project that still fills me with great pride and immense frustration. I thought I might as well finally post a couple of things I wrote about the project back in 2009. They’re a bit out of date, but I think there’s still a few useful lessons to be gleaned.

The first is a case-study that focuses on the crowdsourcing aspects of Mapping Our Anzacs. The second looks at the project as an example of a mashup. Thanks to Kate Theimer for initiating and publishing both pieces.

 


Bringing life to records

2009 preprint version of case-study originally published in Kate Theimer (ed), A Different Kind of Web: New Connections between Archives and Our Users, Society of American Archivists, Chicago, 2011. [Order here]

Overview of repository

The National Archives of Australia is responsible for preserving and making accessible the records of the Commonwealth of Australia. It employs more than 400 staff, with offices in Canberra and every state capital. Its holdings include more than 360 shelf km of records – around 69 million items. Through its digitisation program more than 1.6 million items have been fully digitised, making nearly 20 million digital images available online. The National Archives’ website now provides the main point of access for researchers, with more than 2 million images viewed through the online database RecordSearch in the year 2007–8.

Business drivers

Most people now experience the collections of the National Archives of Australia online. With an obligation to provide ‘an accessible, and interpreted, national archival collection’ the Archives is looking to new technologies to enhance access and improve efficiency.

The idea for Mapping our Anzacs arose during planning for a travelling exhibition on the impact of World War I, timed to coincide with the 90th anniversary of the war’s end. Public interest in commemorating Australia’s war effort was as strong as ever, so a website that encouraged local participation seemed a useful way of extending the exhibition and its accompanying education program.

The major focus of both the exhibition and the website was to be the 376,000 service records documenting the experiences of Australian men and women during World War I held by the National Archives. These records had been fully digitised and described as part of a major project entitled ‘A Gift to the Nation’, but were still somewhat buried within our collection database.

Mapping our Anzacs was intended to highlight these records and open them up to local communities. First a map interface would allow service records to be discovered by place of birth or enlistment. Secondly, users would be able to add tributes – online versions of the war memorials that remain a feature of just about every town, large or small.

While the exhibition and the records themselves provided the main drivers for the project, there was also a growing desire within the institution to explore some of the possibilities of Web 2.0 technologies. This desire was tempered somewhat by a range of familiar concerns centred on issues of authority and control. Would user contributions detract from the reliability of the records? Who would take responsibility for any errors in user-created content? Would the potential for abuse demand vigilant moderation? Mapping our Anzacs gave us a chance to start working through such issues.

Setting the stage

We had an idea, a budget and a launch date, what we needed was a plan. While in theory we had around six months to play with, the project had to be fitted in around the ongoing work of our small web team. On the content side we had one person cleaning up the data. At the technical end we had someone connecting up the various components and making it all work within the Archives’ web environment. In the middle there were two of us trying to marry content and technology and create a usable resource. While we had a range of useful skills, none of us had tackled a project quite like this. We all had to learn on the job.

With few models or examples to work from, we began to experiment – researching available technologies, throwing around possibilities. Our first efforts were largely focused on the map interface and before long we had a working prototype using javascript and Google Maps. But what we also needed was a better understanding of how users might interact with the site.

World War I Honour Roll at the Chiltern Atheneum Museum

We started from the idea of the online memorial – a list of names compiled by users that would be linked through to service records. Our example was a local historical society creating a site to commemorate their community’s war effort. But what if they had more information – photographs or family histories – how could this sort of material be incorporated? Further inspiration came from a visit to the local historical museum in the small Victorian town of Chiltern. On one wall was a typical roll of honour, listing the names of those who had served in the war. But underneath were framed portraits of many of those listed. They were people, not just records. Could we create something like this online?

There were some exciting possibilities emerging, but concerns remained. Would anybody actually want to contribute? Strong interest in family history and a growing community desire to commemorate the experience of World War I offered anecdotal support. We just had to ensure that this interest could be translated into engagement – that the barriers of participation were low enough to encourage visitors to become collaborators.

But what of concerns that such material might detract from the authority of the records, or open our institution up to liability? We needed to make it clear where public contributions began and archival data ended.

Welcoming, but separate; open, but managed – a tricky balancing act was required. The answer, we decided was to create a separate ‘scrapbook’ using the blogging service Tumblr. The ‘scrapbook’ label was intended to be encouraging – this was not a database, or formal register, it was a place to leave your thoughts, comments, information or memorabilia. This was reinforced by our terms of service which simply required contributions to be relevant and respectful.

Scrapbook post

A ‘scrapbook’ was also something quite different to a finding aid. The informality helped to make the boundary clear between record and response. The separation was physical as well as intellectual. While the scrapbook shared many of the design elements of the main site, it was hosted by Tumblr not the National Archives. By using the Tumblr API, however, it was easy to pass information between the two sites. We could also use the API to provide a basic moderation facility.

But this meant that an important part of the site’s functionality would be dependent on an outside service. To make sure we considered fully all the implications of this, we developed a risk analysis and contacted Tumblr staff to inform them of our plans. Our major concern was simply the continuity of the service. While there could be no guarantees, we judged that this risk was manageable. Tumblr staff were interested in the project and offered their assistance if necessary.

Results

Images from scrapbook posts viewed via a media RSS feed in CoolIris.

On 25 April each year, Anzac Day, Australians remember the sacrifices made in war. Over the Anzac Day weekend in 2009, we were astonished to receive more than 200 scrapbook posts. Of course we had expected an increase in use, particularly after the site was featured on the Australian version of the Today Show, but this remarkable response certainly confirmed the site’s success. In the six months since its launch there had been almost 94,000 visitors to Mapping our Anzacs. More than 1,000 scrapbook posts had been contributed and 280 tributes created.

But the greatest success was in the type of posts being contributed rather than their sheer volume. Our ‘scrapbook’ had proved to be just that – as well as photographs of service people and their families, there were pictures of medals, headstones, letters, newspaper clippings, pay books, identity disks, diaries, postcards and certificates. Some people simply commented ‘my grandfather’, while others wrote detailed accounts of family history. Perhaps most moving were those who took the opportunity to leave a message for their loved one: ‘You were the best dad’.

Scrapbook post

Some have taken a systematic approach. Our most frequent contributor is gradually attaching photographs of headstones and memorial plaques that she has gathered from local cemeteries. Others are posting their own contact details in the hope of linking up with family. Perhaps most interesting are the notes that provide links to other people or documents – to family members, for example, or to a later service record. These are helping build a rich web of contextual data. Equally valuable are the corrections and additions that are being offered by eagle-eyed users, pointing out transcription errors or helping us track down elusive locations.

The success of the scrapbook has somewhat overshadowed the tributes, or online memorials, which really provided our starting point. Many tributes have been created and, as we had hoped, schools and other groups are using them to document the impact of war on their local communities. However, some compromises at the implementation stage have meant that it is not as easy to build them as we had hoped. There has also been some confusion by users between the tributes and the scrapbook. This is one area of the site we certainly hope to improve.

Even though the digitised service records had been available online for sometime though our collection database, it’s clear that many people are discovering them for the first time through Mapping our Anzacs. It was ‘a stunning find for me and my siblings’ wrote one grateful user. The scrapbook has aided discovery, providing another way into the records. Indeed, with the addition of a MediaRSS feed for CoolIris, the scrapbook provides two new entry points – one of them a 3D wall of faces and families. By embedding the records in these new contexts and making them easier to find, Mapping our Anzacs has successfully garnered extra value from an existing asset.

The site has also been recognised by others for its successful use of Web 2.0 technologies. We were pleased to be joint winners of the Best Archives on the Web Award, and surprised to be cited by the Federal Minister for Finance in a speech launching a Gov 2.0 taskforce. Recognition such as this has helped strengthen the case for future innovation in the Web 2.0 sphere.

Challenges

Success brings its own problems. One of the main challenges has been simply managing the sheer volume of posts and feedback. This was particularly acute of course after the Anzac Day deluge. As a result we have had to consider ways of streamlining our processes.

The Tumblr API allows us to set the status of a new post as ‘private’. We can then examine the post using the Tumblr dashboard before making it public. This works well enough as a basic form of moderation, however, the dashboard is not really designed for this purpose and it takes several clicks to release each post to the world.

But while moderation takes considerable time, it requires little intellectual effort. Despite concerns about abuse, our contributors have caused us few dilemmas. The only significant questions that have arisen concern the re-use of materials from other sources. This has made consider whether pre-emptive moderation is necessary or appropriate.

While the site includes detailed help information, it’s clear from the feedback that there are certain aspects that continue to cause difficulty. This provides useful data on how the site might be improved, but it has also made us think about how we communicate with our users. At the moment the content we provide is fairly static – there is no way of informing visitors of recent updates, or developing quick guides to common problems. If we took a more active approach to communication we might be able to decrease the number of help requests, while building a greater sense of community.

Similarly, while we have been excited by the number of corrections submitted by users, we can now see ways in which we might have structured the feedback process to capture their corrections more easily and efficiently. For example, a ‘submit a correction’ link on each individual’s page could automatically capture the person’s details, saving both us and our contributors from potential confusion.

We have suffered through the expected number of software glitches, and have a growing list of things we’d like to improve or develop, but overall the experience has been much more rewarding than painful.

Lessons Learned

Perhaps the most valuable lessons revolve around trust. Having entered into the project uncertain of what to expect from public participation, we have found ourselves in an evolving, creative partnership. Our users have defined what the scrapbook is and have taken an active role in improving and developing the resource. Our trust has been repaid many times over, helping us build something that in many ways has exceeded our expectations.

Trust is also necessary in the support of new ideas. Mapping our Anzacs was a very different type of project for the National Archives, challenging ideas both of access and user engagement. By taking the risk we have not only gained valuable publicity and user support, we have opened up the realm of possibilities for future development.

In terms of technology, the project demonstrated the power of the mashup and the efficiencies that can be gained by using existing web services. Tumblr, Google Maps and their associated APIs gave us a kickstart that enabled us to do a lot with a little.

Next Steps

There are so many exciting possibilities! Obviously our first priority is to improve those areas of the site that continue to cause our users grief. There are a number of navigation and usability tweaks that should improve the overall experience. Similarly, we can now see ways in which we might streamline moderation and management processes.

We hope to build on the success of the scrapbook and tributes by enhancing and extending their functionality. Improved editing and creation tools could assist contributors while also enriching the web of connections they build. We might, for example, provide widgets that make it easier to link the records of family members or friends. Over time this could develop into a complex network of relationships, providing new means of finding and visualising the records. Similarly, there are ways in which we might reuse the existing content of the scrapbook posts to develop new modes of discovery.

We could also do more to feature the labours and passions of our contributors. We could give them the option of exposing a public profile that lists all of their scrapbook posts. This would help foster a sense of community while providing yet another means of exploring connections between records.

Recent developments in geospatial technology and mobile devices perhaps offer the most exciting possibilities. Our original aim was to give the World War I service records back to local communities, to imbue the records with a greater sense of context, locality and belonging. Perhaps we will have succeeded when a tourist exploring a small country town can press a button on their mobile phone to retrieve a list of service people born near their current location.

Perhaps they will take a photo of a name on the local war memorial and use it to automatically retrieve that person’s service record or create an online tribute.

Perhaps they will come across a headstone in the local cemetery and immediately upload a geocoded photograph to the Mapping our Anzacs scrapbook.

Instead of merely being markers on a map, the records will start to overlay and inform the very spaces in which we move. The stories they contain will become part of our journeys, the people they document will have found their way home.


Creating a mashup

2009 preprint version of interview originally published in Kate Theimer, Web 2.0 Tools and Strategies for Archives and Local History Collections, Neal-Schuman, New York, 2010. [Order here]

What made you interested in creating a mashup?

It really started with the records. We hold the records of more than 375,000 World War I service people, identified by their places of birth and enlistment. With war memorials in just about every town across Australia, the connection between local communities and the memory of war remains strong. So we wondered how we could we give the service people in our records back to their communities. Having played around with the Google Maps API the answer seemed obvious — find the places, put them on a map and let people explore the connections for themselves.

What information, tools and processes did you need to begin?

The main thing we needed was the confidence to experiment. The process seemed straightforward in principle: first we had to extract the data we needed from the file titles in our collection database, then we had to find the latitude and longitude of each of the place names we extracted, and finally we had to plot these coordinates on a map with links back to details of the service records themselves. Web services, such as those provided by Google Maps, had the potential to do much of the work for us, and we scoured online documentation, user forums and blog posts for hints. But there were many things we could not know until we actually started. How consistent was our data? How many of the places would we be able to find? How would we be able to display thousands of places at once?

Moving from file titles through to coordinates obviously required a lot of data manipulation and we used Perl for much of the grunt work. Because our data set was large and variations in spelling and formatting were often unpredictable (including 13 different spellings of ‘lieutenant’!), we often had to work by trial and error — seeing what results we obtained and then adjusting our processes accordingly.

Once we had a list of place names in a consistent format we could begin to find latitudes and longitudes through a process known as geocoding. Google’s geocoding service was an easy option: it was well documented, reasonably comprehensive and it worked! We fed it our place names through a Perl script and soon we had a list of coordinates. Of course, many places were not found or returned multiple results, but the basic principle was sound. Our places were no longer just names, but points in space — we could begin making maps.

How did you determine what to include?

What we were creating was an archival finding aid, but one which placed the people, their homes and their communities up front rather than the systems that control their records. By browsing from a map a user would be able to find the details of a loved one, read a digitised copy of their service record and then follow a link through to our collection database. These links provide crucial context about the records, but we realised that this project also gave us an opportunity to capture other contexts and meanings. Who were these people? What did they look like? What happened to them after the war? By adding an online ‘scrapbook’ we gave users the chance to enrich the resource by adding notes or photographs about individuals.

This meant we had to deal with three sets of interlinked data: geocoded places, details from our records, and scrapbook posts provided by the public. To bring these all together with limited resources we had to make clever use of what was already out there. Why create our own maps when all you needed to do was write a bit of Javascript to embed a Google Map? Why build our own scrapbook application when the blogging service Tumblr provides free accounts and a simple API to manipulate posts? While a substantial amount of custom scripting was required to glue everything together, much of the core functionality was provided by free web services, available to anyone.

What challenges did you face?

Perhaps the first challenge to overcome was that of imagination. It was difficult for people to understand what the project was until we had a prototype to show them.

The process of handling and cleaning the data at times threatened to overwhelm us. While the geocoding service got us to the point where we could make maps, it also left us with many place names that needed to be manually checked. Often this was the result of misspellings in the original data, or because places either no longer existed or had changed their names. This data cleanup consumed much effort and continues still, though now with the help of our users who regularly point out errors and inconsistencies.

Once we had our coordinates we had to display them on a map without killing anyone’s computer. Showing thousands of markers on a Google Map is a challenge to slower web browsers and can end up hindering navigation. By dividing up our maps, clustering markers and changing the way they were rendered, we managed to greatly improve performance while maintaining the browsing experience. Once again it was trial and error coupled with the advice of the online community that guided us through the roadblocks.

What kinds of positive results have you had? (And any negative ones?)

From the messages we receive it’s clear that Mapping our Anzacs allows people to find records they didn’t know existed. Some have met a great-great-uncle for the first time. Others have learned about the war experience of a much-loved grandparent. Local communities have embraced the project and the scrapbook has developed into a rich and often moving resource. We wanted to give users a new way to explore and interact with our collection, and it seems we have succeeded.

Our users have also become our collaborators, providing corrections and comments that help us improve our data. They have extended the idea of the scrapbook, using it, for example, as a noticeboard for family history research, or as a way of creating crosslinks between related resources.

Success brings problems of its own and the work of moderating the scrapbook and responding to feedback has proved considerable. Issues with performance remain for people on slow connections, and while many are familiar with the Google Maps interface, some find it difficult to navigate. We are planning a number of enhancements based on this feedback, and hope to take advantage of the technology as it evolves to improve and extend the interface.

About how much time did it take?

While the project as a whole stretched over about eight months, much of this time was taken up cleaning and processing the data. The development of the interface was completed in under two months.

What advice would you give an organization wanting to use something similar?

Start experimenting. The technology is developing so rapidly that if you spend 12 months planning a project it’s likely to be out-of-date even before you start. New web services and data sources are becoming available every day. Perhaps you could use Open Calais to extract people’s names from a collection description, or MetaCarta to find the places. You might use the Google Books API to harvest the details of publications that cite your records. Even if you’re not a coder you can use tools like Yahoo Pipes to see what happens when you start to link data and services. Experimentation brings new ideas and possibilities. It’s all about making connections.

Mining for meanings

Yes, I have a suit. On 8 May at the National Library of Australia I gave my suit an outing as I delivered my Harold White Fellowship presentation. Thanks to everyone who came along.

If you missed it or want to relive the fun, the NLA has made a podcast available. My slides are also online, so you can follow along for the full audio-visual-not-quite-3D experience.

Use your arrow keys to navigate through the slides, and yes the first page is intentionally left blank. If you linger for a bit on slide two or three, you’ll see the Trove API in action. The presentation itself was constructed using deck.js.

The slides also include links to lots of different examples and demos, and introduce my new favourite plaything. I don’t really know what to call it yet, or what it’s actually for, but it makes me happy, and it makes me think. TF-IDF FTW. I’ll write up some more details shortly.

The new QueryPic (or what a difference an API makes)

It seems a bit late to be introducing the newest version of QueryPic. Folks are already using it to explore the contents of digitised newspapers made available through Trove and Papers Past. Some, like the National Library of New Zealand, Andrew S. Bowman and the Carnamah Historical Society are already blogging about it. But I suppose I’d better document a few things…

As I noted in my post about QueryPicNZ (yes I now have a rather confusing proliferation of QueryPics), I was waiting for the Trove API to become public. Last week I noticed a little ‘API’ link pop up in the Trove footer and so I set to work…

"The past" versus "the future" in the new QueryPic

My original version of QueryPic (recently reviewed in the Journal of the Digital Humanities) used a series of Python scripts to harvest and scrape content from the Trove web pages. This meant that you had to download the scripts and be code-confident enough to run them in a terminal. It’s still a useful tool and I’ll be updating it as well, but I wanted to create something quicker and simpler that encouraged people to explore and play.

The latest version of QueryPic (QueryPic+, QueryPic Web, QueryPic 2.0?) simply runs in your browser. It uses JQuery to grab data on the fly from the Trove and DigitalNZ APIs. Like previous versions, it uses the HighCharts library to turn the data into pretty graphs.

What does it do? It’s really pretty basic. QueryPic just displays the number of articles matching your search query over time. By default, these are displayed as a proportion of the total articles available for that year, but a dropdown field lets you switch to view the raw numbers. It’s simple, but it’s also remarkably evocative, suggestive and fun. Just try it!

Why stop at just one query? To compare frequency patterns you can add as many as you like. Just keep entering new words or phrases.

If you notice an interesting peak or trough you can just click on it and another API request will be fired off to retrieve the first 20 matching articles. So it’s also a new way of exploring the newspaper databases themselves.

There are plenty of limitations — not all newspapers are digitised, for example, and the quality of the OCR is patchy. The National Library of New Zealand’s post does a great job summing up a number of issues relating to Papers Past. It’s not magic, it’s not perfect, but is it useful? I think so.

Tasks for the future:

  • Create some sort of backend that makes it easy to save , share and cite your query data. The ‘share’ link just regenerates the graph which, of course, might change as new articles are added to the databases.
  • Make it possible to add more complex queries — I want to keep the interface simple, so I’ll probably create a bookmarklet to take any Trove or Papers Past query and display it using QueryPic.
  • As I mentioned over at the WraggeLabs Emporium, I intend to rewrite my various Trove tools to work with the new API. This will include the classic Python version of QueryPic. I still think it’s useful for harvesting your own data.
The code is on my GitHub site and you can also follow updates at the QueryPic page in the WraggeLabs Emporium.

 

QueryPicNZ

You may have noticed I have a bit on an interest in exploring ways of using digitised historical newspapers. In the last year or so I’ve spent a lot of time scraping, mining, processing and visualising content from the Trove collection of digitised Australian newspapers. But what about other countries?

Recently I was invited to a digital history workshop organised by Sydney Shep (@nzsydney) at the Victoria University of Wellington. In between sessions I started to play with the DigitalNZ API guided by Chris McDowall (@fogonwater). In anticipation of the forthcoming Trove API I’d already done a bit of work converting QueryPic to run in the browser. It didn’t take long to adapt this to work with New Zealand newspapers available through Papers Past.

So presenting for your enjoyment and education… QueryPicNZ.

Wind, rain and snow in QueryPicNZ

Like QueryPic, the New Zealand version graphs newspaper search results over time. But thanks to the DigitalNZ API it has a number of advantages:

  • it runs in your browser — no need to download or run any scripts
  • results appear almost instantly
  • easy to combine queries — just search on a new word or phrase
  • easy to remove queries — just use the ‘Clear last’ button
  • easy to share — just copy the provided link or use the Tweet button

It’s limited to simple word or phrase searches at the moment, but eventually I’ll add the ability to process more sophisticated queries. I also want to add a way of saving, sharing and citing graphs. For now the ‘share’ link simply regenerates the graph, so if the content has changed the result could well be different.

The code is available on GitHub.

Ultimately, I want to combine Trove and Papers Past so that you can query and combine content from either Australia or New Zealand… perhaps even other countries?