shoebox
digital humanities

Exposing the archives of White Australia

I recently gave a presentation in the Institute of Historical Research’s Digital History Seminar series. The time difference between London and Canberra was a bit of a challenge, so I pre-recorded the presentation and then sat in my own Twitter backchannel while it played. For the full podcast information go to HistorySPOT. You can also play with my slides or peruse the #dhist Twitter archive.

Exposing the Archives of White Australia from History SPOT on Vimeo.

Bus trips and building

Last week I took my daughter to Sydney so she could attend a girls-only Minecraft workshop at the Powerhouse Museum (they created some wonderful things). It was a 3½ bus journey each way, so to keep myself occupied I set myself the challenge of trying to build something en route. I made a fair bit of progress, but ultimately failed. I had to steal a few extra hours this week to get it to the point where people might find it useful.

The Australian WWI Records Finder

The Australian WWI Records Finder

So here it is — a (sort of) aggregated search interface to records about Australian First World War service personnel. Give it a name and it will search:

It’s ‘sort-of’ aggregated because it’s really just a series of separate searches presented on the one page. But even this should make it easier for people to match up records across the different data sets.

Using

Type in a family name and, optionally, a given name or a service number. Hit search. Wait. Wait a bit more. The National Archives’ RecordSearch database can often be pretty slow. Eventually though, each of the databases will be queried in turn and the results added to the page.

Once the results have loaded, click on a title and the little spinny thing will start up again as more details are retrieved from the database. In this ‘detail’ view, all the other results from the database are hidden. This makes it a bit easier to compare records across databases. Just click on the title again to go back to the ‘list’ view.

If your search returns lots of results, you can use the ‘next’ and ‘previous’ links to explore the complete set. They’ll all load in the current page via the magic of AJAX.

It’s not obvious from the interface, but you can feed query parameters directly via the url. For example try http://wraggelabs.com/ww1-records/?family_name=wragge. Why is this useful? Perhaps you’ve got your own database of names on the web. Using this you could easily create links from each name that looked for relevant records in the Finder.

That’s about it. It’s just a quick, bus-trip-inspired experiment, so there are many limitations and future possibilities.

Limitations

<!–INSERT USUAL WARNING ABOUT THE FRUSTRATIONS OF SCREEN SCRAPING–>

I’m just using the standard search interfaces of the various databases and screen-scraping the results. Unfortunately they all work slightly differently. For example, the AWM databases don’t distinguish between family names and given names, so if you search for the family name ‘Smith’ you’ll also get results like ‘Jones, Bruce Smith’. The CWGC database, on the other hand, will only match an other name if it comes first, while RecordSearch (or more strictly NameSearch) will also match the names of next-of-kin. Fun fun fun.

I figure anything is better than nothing, but if you’re not getting the results you expect head off to the original interfaces and try your luck there. I’m making no promises.

You’ll also notice that the maximum number of results for each data source varies. The CWGC returns 15 results, while the AWM hands over a whopping 50. These are just the default settings for the original search engines. I could’ve fiddled with the settings, but it didn’t really seem worth it.

And oh yeah… screen scraping… inherently fragile… might fall over and die at any minute.

Possibilities

As you may have guessed from previous posts, I rather like making connections. This experiment grew out of the work I’m doing on the ‘Doing Our Bit’ project with the Mosman Library. I’ve been building a series of forms that will make it easy for contributors to link people in the Mosman project to any of these databases. Just paste in a url from RecordSearch and the system will automagically retrieve all the file metadata and also check for an entry in Mapping our Anzacs. It’s pretty nifty. But of course it made me think about having a way to search across all these different databases.

And then what?

Having found a series of records for an individual it would be good if they could then be permanently linked. If I had the time and money to do more work on this, I’d want to allow people to save the connections they find. And of course then expose these connections as Linked Open Data. It wouldn’t be difficult.

There’s probably also a lot more that could be done with machine matching of records. Perhaps someone’s already working on this for the centenary — it seems like an obvious point of attack. It would be good if the forthcoming centenary commemorations resulted in something that brought all these datasets together and exposed identifiers that could be easily used by community projects like ‘Doing Our Bit’.

Details

Yes, I cheated. I had already done a lot of work on the screen-scrapery bits of this pre bus trip. I’ve been working a RecordSearch client on and off for a while to use with projects like Invisible Australians. The AWM and CWGC scrapers I wrote for ‘Doing Our Bit’. Feel free to grab the code and play.

The actual application was built using the Python micro-framework Flask. I’m a big fan of Django, but there’s a lot of overhead involved if you just want to throw together a simple app. I’ve been wanting to try Flask for a while and was pleased to find just how quick and fun it was to get something up and running.

To make the whole thing as responsive as possible, the search results are retrieved using AJAX calls to simple APIs I built in Flask on top of my screen scraper code. There’s actually very little code in the Flask app itself. The downside of this is that the Javascript is a bit of a mess. Ah well.

Next

I don’t know whether I can put any more time into this at the moment — too many other projects competing for my time and no more bus trips coming up. But if you think it’s useful or worthwhile please let me know and I’ll see what I can do.

At the very least it shows how with just a little impatience and ingenuity we can find fairly simple ways to integrate records from a variety of sources. We don’t have to wait for some centralised solution.

2012 — The Making

I obviously did a lot of talking in 2012, but I also made a few things…

The evolution of QueryPic

Screen Shot 2012-09-27 at 12.08.28 AM

Try QueryPic

At the start of 2012 QueryPic was a fairly messy Python script that scraped data from the Trove newspaper database and generated a local html file. It worked well enough and was generously reviewed in the Journal of Digital Humanities. But QueryPic’s ability to generate a quick visualisation of a newspaper search was undermined by the work necessary to get the script running in the first place. I wanted it to be easy and accessible for everyone.

Fortunately the folks at the National Library of Australia had already started work on an API. Once it became available for beta testing, I started rebuilding QueryPic — replacing the Python and screen-scraping with Javascript and JSON.

In the meantime, I headed over the New Zealand for a digital history workshop and began to wonder about building a NZ version of QueryPic based on the content of Papers Past, available through the DigitialNZ API. The work I’d already done with the Trove API made this remarkable easy and QueryPic NZ was born.

Once the Trove API was publicly released I finished off the new version of QueryPic. Instead of a Python script that had to be downloaded and run from the command line, QueryPic was now a simple web form that generated visualisations on demand.

The new version also included a ‘shareable’ link, but all this really did was regenerate the query. There was no way of citing a visualisation as it existed at a certain point in time. If QueryPic was going to be of scholarly use, it needed to be properly citable. I also wanted to make it possible to visualise more complex queries.

And so the next step in QueryPic’s evolution was to hook the web form to a backend database that would store queries and make them available through persistent urls. With the addition of various other bells and whistles, QueryPic became a fully-fledged web application — a place for people to play, to share and to explore.

Headlines and history

Explore The Front Page

Explore The Front Page

Back in 2011 I started examining ways of finding and extracting editorials from digitised newspapers.  Because the location of editorials is often tied up with the main news stories, this started me thinking about when the news moved to the front page. And of course this meant that I ended up downloading the metadata for four million newspaper articles and building a public web application — The Front Page — to explore the results. ;-)

The Front Page was also the first resource published on my new dhistory site (since joined by the Archives Viewer and QueryPic). dhistory — ‘your digital history workbench’ — is where I hope to collect tools and resources that have graduated from WraggeLabs.

Viewing archives

Try Archives Viewer

Try Archives Viewer

In 2012 I also revisited some older projects. After much hair-pulling and head-scratching, I finally managed to get the Zotero translator for the National Archives of Australia’s RecordSearch database working nicely again. I also updated it to work with the latest versions of Zotero, including the new bookmarklet.

My various userscripts for RecordSearch also needed some maintenance. This prompted me to reconsider my hacked together alternative interface for viewing digitised files in RecordSearch. While the userscript worked pretty well, there were limits to what I could do. The alternative was to build a separate web interface… and so the Archives Viewer was born.

Stories and data

Expect bugs ye who enter here...

Expect bugs ye who enter here…

 

In the ‘work-in-progress’ category is the demo I put together for my NDF2012 talk, Small stories in a big data world. Expect to see more of this…

My favourite things

Two things I made in 2012 are rather special (to me at least). Instead of responding to particular needs or frustrations, these projects emerged from late night flashes of inspiration — ‘what if…?’ moments. They’re not particularly useful, but both have encouraged me to think about what I do in different ways.

Play!

Play!

The Future of the Past is a way of exploring a set of newspaper articles from Trove. I’ve told the story of its creation elsewhere — I simply fell in love with the evocative combinations of words that were being generated by text analysis and wanted to share them. It’s playful, surprising and frustrating. And you can make your own tweetable fridge poetry!

Screen Shot 2012-07-10 at 5.20.45 PM

The People Inside

One night I was thinking about The Real Face of White Australia and the work I’d done extracting photos of people from the records of the National Archives of Australia’s database. I wondered what would happen if we went the other way — if we put the people back into RecordSearch. The result was The People Inside – an experiment in rethinking archival interfaces.

 

Teaching by example?

There’s been plenty of discussion within the digital humanities community about the difficulty of getting academic recognition for digital projects. But what about being recognised for alternative forms of teaching? I don’t mean online courses, I mean the sort of peer-to-peer teaching that takes place through blogs, or Twitter, or the comments in our code. We all learn from each other.

I’ve been thinking about this while working on a few job applications recently. My opportunities for formal teaching or supervision have been limited, but over the last few years I’ve worked hard to introduce the digital humanities to a broad range of audiences. I’ve given talks to all sorts of professional and community groups, including librarians, museum curators, archivists and family historians. I’ve organised a couple of THATCamps. I’ve given papers at disciplinary conferences. I’ve blogged about my experiments and my frustrations. I’ve created a series of digital tools and made them available for all to use. Most recently I’ve been visiting universities giving talks and workshops to help staff and students make use of digital tools and resources in their own research. But I don’t ‘teach’ — or do I?

Most of this work is unpaid of course. I do it because I love it, and because I think it’s important. I do it because I want DH to live up to it’s promise of being open and engaging — I want others to share the excitement, the possibilities and the power. Sometimes it’s hard to know if it really makes any difference — usually I only hear anecdotally about the way my tools are used. But when I do receive feedback from people it’s often to say how I’ve ‘inspired’ them.

It seems to me that the ability to teach by example, to broaden horizons, and offer inspiration, is something that should find a place in a job application, but where? As I was pondering this the other night I fired off an idle tweet that brought a couple of encouraging responses:

So I’ve adopted @ProfessMoravec’s suggestion and created a Testimonials page. If I’ve managed to inspire or assist you in some way, feel free to leave a comment. Maybe next time I put together a job application I’ll have something to point to to demonstrate my ‘teaching’ credentials.

Too important not to try

On Friday 19 October I joined an enthusiastic group of digital humanities explorers at a Deakin University event entitled Dipping a Toe into the Digital Humanities and Creative Arts. @catspyjamasnz has assembled an excellent summary of the day in Storify.

In the morning I told the story of Invisible Australians. You can view the slides of Too or important not to try and listen to my dodgy audio recording via SoundCloud.

In the afternoon I gave a whirlwind workshop which included a headline roulette smackdown and an introduction to the wonders of Zotero.

Digital disruptions: Finding new ways to break things

Recently I gave a presentation at the University of Melbourne’s Faculty of Arts eResearch Forum. The slides for my talk, ‘Digital Disruptions: Finding New Ways to Break Things’, are available online (thanks to reveal.js). I also managed to make a fairly basic recording — I’m intending to create a transcript, but for now you’re welcome to download and listen you can listen via SoundCloud.

Basically I was arguing that as well as making stuff, digital humanities can involve a lot of stretching, twisting, pushing and breaking stuff. The web is not fixed or static, there are many points at which we can intervene and change the way information is presented. What we need is confidence to pull things apart, and the ability to critically examine why things work the way they do (or don’t). And imagine alternatives.

After my talk there were a number of interesting reports from people around the university. Brett Holman has provided a great summary on his Airminded blog, as well as doing his best to find me a job!

For you, with all best wishes…

Yep, there’s a new version of QueryPic.

About 18 months ago I created a little Python script to visualise search results in Trove’s collection of digitised newspapers. After a bit more tweaking. I christened it QueryPic. People started to use it. It was even reviewed in the Journal of Digital Humanities. With the release of the Trove API earlier this year I rewrote the whole thing in Javascript and let it loose on the web. People could make graphs without having to download any code or fire up the command line. Anyone could play.

And now?

The latest version lets you save your QueryPics. As new features go it’s not very revolutionary. But it meant another significant shift. From Python script, to web page, to web app. The Javascript-enabled interface now connects to a Django-powered backend. Save a graph and you can access it via a lovely, short, persistent url (like this). It’s as much a platform as a tool. But to be persistent, the urls need to work for ummm… a long time. Is this even possible for a project that has no funding and a support team of one?

I don’t know.

My enthusiasm for making tools is punctuated by regular bouts of doubt and disillusionment. With millions of dollars being spent on industrial-strength digital research infrastructure why should I devote my evenings to hand-crafting pretty little widgets like QueryPic?

My grandfather made this brass dish. He owned an engineering workshop and forge. My dad was a draftsman, engineer and builder. My mum made fine dresses in the fashion houses of Melbourne. I make things too. It’s what I do. It took me quite a few years to work this out. Years spent wondering why I felt out of place in academia. I’m also a historian, so I research and I write, but without some time to tinker, well… I’m just not happy. Making things is not separate — for me it’s all part of being a historian. I make things that let people connect to the past in different ways. And along the way I learn.

And by people I mean people. Just last week I took part in an online question and answer session organised by Inside History Magazine. It was a lot of fun. Amidst the questioning I unveiled the latest version of QueryPic. Considerable excitement ensued. QueryPic graphs are starting to be included in research publications, but anyone can make and understand them. Local and family historians are enthusiastic users of digital technologies and I’m excited to see them playing around with tools that I’ve made. I want to create things that other people use. Things that help them, and sometimes surprise them.

QueryPic has graduated from WraggeLabs to dhistory — my platform for digital history research. There it joins The Front Page and Archives Viewer. As usual, I have big plans. Are they practical? Probably not. Are they sustainable? I doubt it. Will I keeping making things anyway? Of course.

So please accept this gift. I made it for you. I hope you find it useful.

QueryPic — explore digitised newspapers from Australia & New Zealand.

http://dhistory.org/querypic/

Features include:

  • Save and date-stamp your graphs with persistent urls — perfect for citing and sharing
  • Copy and paste query urls from Trove or Digital NZ, or connect automatically with a handy bookmarklet
  • Easily regenerate saved graphs to draw in updated data
  • Explore QueryPics created by others — use them as the starting point for your own visualisations
  • Combine any number of queries, either from Australia or New Zealand
  • Click on the graphs to preview matching articles

All this and more documented on QueryPic’s extensive help page. Code on Github.

Old loves, new views…

I’m deeply in love with the collections of the National Archives of Australia. They move me, they inspire me, they make me want to do something. How do I express my love? I’ve written stories about things like atomic bombs, progress, astronomy and weather forecasting — pursuing lives and events documented in the Archives’ rich holdings. I work on projects like Invisible Australians, hoping to bring the compelling remnants of the White Australia Policy to broader public attention. And I build things. I make tools that help other people explore, understand and use the Archives. I do this because these riches need to be used. They need to be shared. They need to be part of the fabric of our lives.

A few years ago I created a little script for Firefox that put a fresh face on the display of digitised records in the National Archives’ RecordSearch database. It’s publicly available and has been installed more than 500 times. Demonstrating this script at the ‘Doing our bit’ Build-a-thon a few weeks ago made me realise again both how useful it was and how much work it still needed.

One of the most exciting features when I first created the script was the ability to display the records on a ’3D wall’, courtesy of a Firefox plugin called CoolIris. But CoolIris uses Flash and is no longer being supported. Time for a new approach.

Say hello to the Archives Viewer (naming things isn’t really one of my strengths). Instead of rewriting my existing script I decided to create a completely new web application. Why? Mainly because it gave me a lot more flexibility. I could also make use of a variety of existing tools and frameworks like Django, Bootstrap, Isotope and FancyBox. Standing upon the code of giants, I had the whole thing up and running in a single weekend. The code is available on GitHub.

What does it do? Simply put, just feed the Archives Viewer the barcode of a digitised file in RecordSearch and it grabs the metadata and images and displays them in a variety of useful ways. It’s really pretty simple, both in execution and design.

Yep, there’s a wall. It’s not quite as spacey and zoom-y as the CoolIris version, but perhaps that’s a good thing. It’s just a flat wall of page image thumbnails with a bit of lightbox-style magic thrown in. But when I say just, well… look for yourself. There’s something a bit magical about seeing all the pages of a file at once, taking in their shapes and colours as well as their content. This digital wall provides a strangely powerful reminder of the physical object.

National Archives of Australia: ST84/1, 1908/471-480

Of course you can also view the file page by page if you want. Printing is a snap — just type in any combination of pages or page ranges and hit the button. The images and metadata are assembled ready to print. No more wondering ‘which file did this print out come from?’.

But perhaps the most important feature is that each page has it’s own unique, persistent url. Basic stuff, but oh, so important. With a good url you can share and cite. Find something exciting? Tell the world about it! I’ve included your typical social media share buttons to help you along.

One disadvantage over the original userscript is that the viewer isn’t directly linked to RecordSearch. You probably don’t want to have to cut and paste the barcode every time you view a file. So I’ve also created a couple of connectors that ummm… connect things up.

The first connector is just a bookmarklet. A bookmarklet is just a little piece of javascript code disguised as a browser bookmark. Just drag this link — Archives Viewer — to your browser’s bookmark toolbar. Then when you’re on the item page of a digitised file in RecordSearch, just click the bookmarklet and you’ll be instantly transported to the wall.

The second connector is a bit smarter. It’s an enhanced version of another userscript I wrote to display the number of pages in a digitised file. It still does that, but now it also rewrites the links to the digitised files so that they automatically open in the Archives Viewer. It’s a bit harder to install. You need Chrome or Firefox and the add-ons Greasemonkey (for Firefox) or Tampermonkey (for Chrome). Then just go to the userscript page and hit the big ‘Install’ button.

You might be wondering about Zotero (at least I hope you are). My Zotero-RecordSearch translator lets you capture page images and metadata direct to your own research database, so what happens when you’re transported across to the Archives Viewer? Never fear, I’ve written a new translator that lets you save pages as you could in RecordSearch. Even better, you get a persistent, context-enriched url, and the ability to capture multiple pages at once. Yippee!

But that’s not quite all. Buried within the pages is some lovely Linked Open Data. To be truthful, it’s not really very ‘linked’ yet, but it does expose the basic metadata in a machine-readable form, borrowing from the vocabularies of projects like Locah and the Archival Ontology. It’s an experiment, as is the Archives Viewer itself. We can learn by doing.

I’ve given quite a few talks over recent times encouraging people to take up their tools and start hacking away at the digital collections of our cultural institutions. Yes, I admit it, I’m an impatient historian (and a grumpy one at that). But it’s also because I think it’s important that we recognise that access is never just something you’re given. It’s something that we make through our stories, our projects, and our tools. It’s something that’s grounded in respect and powered by love.

‘Doing our bit’ Build-a-thon

BUILD-A-THON

Last Saturday I was amongst a group of enthusiastic and knowledgeable volunteers getting stuck in to the ‘Doing our bit’ project at the Mosman Library. The Build-a-thon was the first stage in creating a new online resource documenting the experiences of World War I service people related to the Mosman area. We’re trying to make the whole process as open as possible, so the Build-a-thon was a way of exploring resources, issues, interfaces and ideas before we lay down too much code. You can read more on the project blog.

To provide some context for our labours, I gave a series of short talks:

  • ‘Small stories in a big data world’ [video] [links]
  • ‘A digital history toolkit’ [video] [links]
  • ‘Telling stories and building interfaces’ [video] [links]
  • ‘Connections and contexts through Linked Open Data’ [links]

You can see how the day unfolded on Storify, and view the participants hard at work on Flickr.

4 million articles later…

On 15 April 1944 the Sydney Morning Herald turned inside out. For more than a hundred years, the front page had been dominated by advertisements, but this changed suddenly in 1944 as the newspaper took on a completely new look. In place of the ads were the day’s top stories, headlines and photographs — a ‘front page’ design familiar to modern readers.

The change was, the newspaper explained, partly a response to the demands of war. Advertising had been cut due to the rationing of newsprint and ‘an urgent public demand in these critical days for more papers and more news’. But they were also looking forward to the problems of peace:

It is essential… that we should not only provide the space, but also adopt the manner and methods of presentation which will spread knowledge of these problems yet more widely, and bring them home yet more deeply, among the people of this country.

But the Sydney Morning Herald wasn’t breaking new ground. The design of front pages had been changing across the first half of the twentieth century as advertisements gradually gave way to news. This graph shows the average number of words per issue on the front pages of Australian newspapers devoted to advertising.

You can see a clear decline from about the turn of the century. News articles, on the other hand, were on the way up.

Not all the changes were as sudden as the Sydney Morning Herald‘s. The Barrier Miner entered the First World War with the ads on top, but by war’s end the position was reversed. In between was a period of transition as you can see from this graph which plots advertising against news.

If you dig a bit deeper, you find that the amount of advertising follows a regular pattern.

These peaks and troughs in June 1916 are a week apart — Saturday’s front page was all advertising, but the next day brought a ‘Special Sunday Issue’ focused on the ‘Latest War News’.

It’s clear just from these two examples that there are stories behind these changes. There are subtleties and contingencies to be explored along with dramatic shifts.

And now you can explore them…

The Front Page

The Front Page is a database containing details of more than 4 million front page newspaper articles harvested from the National Library of Australia’s Trove service.

Trove divides articles into a series of categories:

  • articles (news)
  • advertising
  • detailed lists, results, guides
  • family notices
  • literature

I’ve simply gone through and added up the numbers of articles and the numbers of words in each category for each issue, and aggregated this across months, years and the full run of each newspaper.

These totals are presented as a series of linked tables and graphs. Just click on a point to zoom in, or use the navigation controls to go directly to the issue of your choice. It’s pretty straightforward.

Why?

We’re lucky to have rich resources like Trove, but if we’re going to make best use of them we have to move beyond the search box to find new ways of exploring and contexualising their content. That’s why I’ve developed tools like QueryPic, Headline Roulette and even The future of the past. Each lets you engage with the newspaper database in a different way.

But not all newspaper articles are created equal. I’d like to be able to aggregate and analyse the ‘top’ stories for each day, but to do this I need to know more about the structure of the newspapers themselves. I’ve already made a few attempts to find and extract editorials. This is useful because before the main news moved to the front page it was often directly after the editorials. But when did the news shift to the front page?

Now I can find out.

But why create a public web resource? Well, it’s just what I do. I build and I share. It’s what motivates me. It’s how I understand things. It’s where I find both my questions and my answers. Hey, I’m a digital humanist ok?

How?

Everything’s up on GitHub, so you can follow along with my ugly coding. It was all a bit of an experiment, because I simply didn’t know whether I could harvest and use 4 million articles. How long would it take? Would MySQL grind to a halt? Would my laptop blow up?

In my Harold White lecture I wondered whether what I was trying to do was really beyond the reach of ‘an ordinary bloke and his laptop’. I suspect the day is rapidly coming where my work will be superceded by well-funded academic projects with access to supercomputers and a pool of bright young graduate students. But for now I’ll just keep pushing the boundaries of what’s possible over a dodgy home broadband connection.

Of course, this project was only possible because of the Trove API. My screen-scrapers of yore would have been impossibly slow and wasteful of bandwith. With the API I could simply construct a query and then loop through the 4 million articles in batches of a hundred. These were then fed into MySql via Django. I quickly worked out that I needed to keep my Django models simple. My clever relational model linking newspapers, issues, pages and articles was just too complex for this sort of operation. I flattened everything out to store all the metadata in a single ‘article’ model.

The harvesting operation took about 5 days. Once I had all the metadata I ran a couple of processes to do all the adding up and saved the results to a separate ‘totals’ table.

Then it was just a matter of building a front end. Using Django, Twitter Bootstrap and HighCharts made this amazingly easy. Really. Really truly.

What now?

I built this because I wanted to track changes in the design of front pages, but now I’m wondering what else I can find. The role of war in the examples above is intriguing. Are there other changes in our relationship to ‘news’ that these graphs might reveal?

I hope other people will wonder about this as well.

I have some ideas for future developments. For example, I’d like to add tagging to make it easy to construct timelines of significant changes. But first I just want to see if anybody’s actually interested. If you have any ideas, suggestions or comments please let me know.

Ok, off you go — explore.

The future of the past

[view on Storify]

This is a story about a thing I made. I’m still not sure what to call it. Or what it’s really for.

But I like it.

And I hope other people will too…
Continue reading »

Topic modelling in the archives

There seems to be a lot of topic modelling going on at the moment. Any why not? Projects like Mining the Dispatch are demonstrating the possibilities. Tools like Mallet are making it easy. And generous DHers like Ted Underwood and Scott Weingart are doing a great job explaining what it is and how it works.

I’ve talked briefly about using topic modelling to explore digitised newspapers, something that the Mapping Texts project has also been investigating. But I’ve also been following with interest Chad Black’s use of algorithmic techniques, including topic modelling, to look for local variations amidst the legal system of the early modern Spanish empire.

As part of the Invisible Australians project, Kate and I are exploring the bureaucracy of the White Australia Policy. In particular, we’re interested in the interaction between policy and practice, between the highly-centralised bureaucracy and the activities of individual port officials. Like Chad, we’re interested in mapping local variations — to try and understand the bureaucracy from the point of view of an individual forced to live within its restrictions.

I recently gave a presentation about the project at Digital Humanities Australasia (post coming soon!), and in preparation I decided to try a few topic modelling experiments. They were very simple, but I was impressed by the possibilities for exploring archival systems.

The problem I started with was this. The workings of the White Australia Policy are well documented by records held by the National Archives of Australia. Some series within the archives are specifically related to the operations of the policy — such as those containing many thousands of CEDTs. But there are also general correspondence series created by the customs offices in each state, as well as the Commonwealth Department of External Affairs which administered the Immigration Restriction Act (responsibility was later taken by the Department of Home and Territories and it’s successors). These general correspondence series are important, because they often include details of difficult or controversial cases — those that required a policy judgment, or prompted a change in existing practices. But how do you find relevant files within series that can contain large numbers of items?

Series A1, for example, is a correspondence series created by the Department of External Affairs. It contains more than 60,000 items. Past research tells us that amongst these 60,000 files are records of important policy discussions relating to White Australia. But these files tend to be labelled with the names of the people involved, so unless you know the names in advance they can be difficult to find.

Mitchell Whitelaw’s A1 Explorer, part of the Visible Archive project, lets you to explore the contents of Series A1 in a easy and engaging way. But while the A1 Explorer provides new opportunities for discovery, it doesn’t offer the fine-grained analysis we need to sift out the files we’re after. And so… topic modelling.

The process was pretty simple. While I can dip into my bag of screen-scrapers to harvest series directly from the NAA’s RecordSearch database, there was already an XML dump of A1 available from data.gov.au. So I extracted the basic file metadata from the XML and wrote the identifiers and titles out to a text file, one item per line. Following the instructions on the website I then loaded this file into Mallet:

/Applications/Mallet/bin/mallet import-file --input ./A1.txt --output A1.mallet --keep-sequence --remove-stopwords

Then it was just a matter of firing up the topic modeller:

/Applications/Mallet/bin/mallet train-topics --input ./A1.mallet --output-state ./A1.gz --output-doc-topics ./A1-topics.txt --output-topic-keys ./A1-keys.txt --num-topics 40

Again, I just followed the examples on the Mallet site.

Once it was finished I opened up A1-keys.txt to browse the ‘topics’ Mallet had found. The results were intriguing. There are a large number of applications for naturalisation in A1, so it’s no surprise that ‘naturalisation’ figures prominently in a number of the topics. What was more interesting was the way Mallet had grouped the naturalisation files. For example:

naturalization christian hans hansen jensen petersen andersen nielsen larsen christensen johannes jens niels pedersen andreas johansen martin jorgensen

and

naturalisation certificate giuseppe salvatore frank la leo samios spina sorbello leonardo fisher natale patane torrisi barbagallo luka rossi ross

Based on the co-occurrence of names within the file titles, Mallet had created groupings that roughly reflected the ethnic origins of applicants. It makes sense when you think about what Mallet is doing, but I still found it pretty amazing.

Mallet also found clusters around the major activities of the department, such as the administration of the territories. But of most interest to us was:

1 0.55539 passport ah student exemption students lee wong chinese young deserter education sing wing chong readmission son hing chin wife

The Chinese names alongside words such as ‘readmission’ and ‘wife’ suggested that this topic revolved around the administration of the White Australia Policy. This was easy to test. In A1-topics.txt was a list of every file in the series and their weightings in relation to each of the topics. I wasn’t sure what was a reasonable cut-off value to use in assessing the weightings, but after a bit of trial and error I fixed on a value of 0.7. I then just extracted the identifiers of every file that had a weighting greater than 0.7 for this topic. I used the identifiers to build a simple web page that Kate and I could browse. I also included links back to RecordSearch so we could explore further.

Browse the full list

It’s a pretty impressive result. Instead of fumbling with the uncertainties of keyword searches, we now have a list of more than 1,300 files that are clearly of relevance to Invisible Australians. There’s a few false positives and there are likely to be other files that we’ll have missed altogether, but now we have a much clearer picture of the types of files that are included and how they are described.

And that was at my first attempt, simply using the default settings. I’m now starting to play around with some of Mallet’s configuration options to see what sort of difference they make. I’m also keen to try out GenSim, a topic modelling package for Python.

I’m really excited about the possibilities of these sort of tools for analysing the contents of archival descriptive systems, something I mentioned in my Digital Humanities Australasia paper. Much more to come on this I suspect…

Mining for meanings

Yes, I have a suit. On 8 May at the National Library of Australia I gave my suit an outing as I delivered my Harold White Fellowship presentation. Thanks to everyone who came along.

If you missed it or want to relive the fun, the NLA has made a podcast available. My slides are also online, so you can follow along for the full audio-visual-not-quite-3D experience.

Use your arrow keys to navigate through the slides, and yes the first page is intentionally left blank. If you linger for a bit on slide two or three, you’ll see the Trove API in action. The presentation itself was constructed using deck.js.

The slides also include links to lots of different examples and demos, and introduce my new favourite plaything. I don’t really know what to call it yet, or what it’s actually for, but it makes me happy, and it makes me think. TF-IDF FTW. I’ll write up some more details shortly.

The new QueryPic (or what a difference an API makes)

It seems a bit late to be introducing the newest version of QueryPic. Folks are already using it to explore the contents of digitised newspapers made available through Trove and Papers Past. Some, like the National Library of New Zealand, Andrew S. Bowman and the Carnamah Historical Society are already blogging about it. But I suppose I’d better document a few things…

As I noted in my post about QueryPicNZ (yes I now have a rather confusing proliferation of QueryPics), I was waiting for the Trove API to become public. Last week I noticed a little ‘API’ link pop up in the Trove footer and so I set to work…

"The past" versus "the future" in the new QueryPic

My original version of QueryPic (recently reviewed in the Journal of the Digital Humanities) used a series of Python scripts to harvest and scrape content from the Trove web pages. This meant that you had to download the scripts and be code-confident enough to run them in a terminal. It’s still a useful tool and I’ll be updating it as well, but I wanted to create something quicker and simpler that encouraged people to explore and play.

The latest version of QueryPic (QueryPic+, QueryPic Web, QueryPic 2.0?) simply runs in your browser. It uses JQuery to grab data on the fly from the Trove and DigitalNZ APIs. Like previous versions, it uses the HighCharts library to turn the data into pretty graphs.

What does it do? It’s really pretty basic. QueryPic just displays the number of articles matching your search query over time. By default, these are displayed as a proportion of the total articles available for that year, but a dropdown field lets you switch to view the raw numbers. It’s simple, but it’s also remarkably evocative, suggestive and fun. Just try it!

Why stop at just one query? To compare frequency patterns you can add as many as you like. Just keep entering new words or phrases.

If you notice an interesting peak or trough you can just click on it and another API request will be fired off to retrieve the first 20 matching articles. So it’s also a new way of exploring the newspaper databases themselves.

There are plenty of limitations — not all newspapers are digitised, for example, and the quality of the OCR is patchy. The National Library of New Zealand’s post does a great job summing up a number of issues relating to Papers Past. It’s not magic, it’s not perfect, but is it useful? I think so.

Tasks for the future:

  • Create some sort of backend that makes it easy to save , share and cite your query data. The ‘share’ link just regenerates the graph which, of course, might change as new articles are added to the databases.
  • Make it possible to add more complex queries — I want to keep the interface simple, so I’ll probably create a bookmarklet to take any Trove or Papers Past query and display it using QueryPic.
  • As I mentioned over at the WraggeLabs Emporium, I intend to rewrite my various Trove tools to work with the new API. This will include the classic Python version of QueryPic. I still think it’s useful for harvesting your own data.
The code is on my GitHub site and you can also follow updates at the QueryPic page in the WraggeLabs Emporium.

 

QueryPicNZ

You may have noticed I have a bit on an interest in exploring ways of using digitised historical newspapers. In the last year or so I’ve spent a lot of time scraping, mining, processing and visualising content from the Trove collection of digitised Australian newspapers. But what about other countries?

Recently I was invited to a digital history workshop organised by Sydney Shep (@nzsydney) at the Victoria University of Wellington. In between sessions I started to play with the DigitalNZ API guided by Chris McDowall (@fogonwater). In anticipation of the forthcoming Trove API I’d already done a bit of work converting QueryPic to run in the browser. It didn’t take long to adapt this to work with New Zealand newspapers available through Papers Past.

So presenting for your enjoyment and education… QueryPicNZ.

Wind, rain and snow in QueryPicNZ

Like QueryPic, the New Zealand version graphs newspaper search results over time. But thanks to the DigitalNZ API it has a number of advantages:

  • it runs in your browser — no need to download or run any scripts
  • results appear almost instantly
  • easy to combine queries — just search on a new word or phrase
  • easy to remove queries — just use the ‘Clear last’ button
  • easy to share — just copy the provided link or use the Tweet button

It’s limited to simple word or phrase searches at the moment, but eventually I’ll add the ability to process more sophisticated queries. I also want to add a way of saving, sharing and citing graphs. For now the ‘share’ link simply regenerates the graph, so if the content has changed the result could well be different.

The code is available on GitHub.

Ultimately, I want to combine Trove and Papers Past so that you can query and combine content from either Australia or New Zealand… perhaps even other countries?

Mining the treasures of Trove

In February I made a quick dash to Melbourne to talk at VALA2012.

The paper I originally submitted, ‘Mining the treasures of Trove: New approaches and new tools’, provided a general introduction to the use of digitised historical newspapers and the possibilities of digital history. You can download the pdf from the VALA2012 proceedings, or view online at Scribd.

I ended up presenting something a little different, focusing on my recent work around 1913 and extracting editorials from the Trove newspaper database. You can view the slides on Slideshare or watch a video of the whole presentation on the VALA2012 site.

Extracting editorials #3

By my own criteria I’ve already failed… I started this series of posts with the intention of documenting the process of finding and extracting editorials as I was actually doing the work. But here I am about to describe some work I finished a few weeks back. Oh well…

In my previous instalments (here and here), I focused on the Sydney Morning Herald. Having continued the hunt for missing editorials I started in the last post, I’ve now got a CSV file with the urls of the first editorial published in every edition of the SMH from 1913. Good-o, I thought, I can now start harvesting and analysing some content.

But then ensued a crisis of faith. The whole point of this exercise was to be able to build up some comparisons  – between newspapers, between states, between the city and the bush. But the process of actually finding the editorials seemed beset with difficulties. Could the rules I developed for the SMH be applied elsewhere? Could I ever assemble a useful set of editorials without large amounts of human intervention? I decided to try a few quick experiments to see whether the whole project was worth pursuing.

I started with a few assumptions:

  1. The first (and only the first) editorial in any issue is headed with the name of the newspaper.
  2. Editorials are published on even numbered pages.
  3. Editorials vary in length between about 100 and 1500 words.

These assumptions were based on my own experience as a long-time newspaper researcher and on some preliminary poking around. For example, when I looked at The Argus I noticed that editorials were typically followed by news summaries. Unfortunately, these are treated as a single article in Trove, resulting in large blocks of text that are only part editorial. By specifying an upper word limit I hoped to filter these sorts of articles out. Similarly, there are sometimes brief announcements or publication details headed with the name of the newspaper. The lower word limit was intended to exclude these.

The next step was to harvest every article from 1913 that was headed with the name of its publication. I created a script to generate a list of all the newspapers that published issues in 1913. Then I called my existing harvester to download all the matching articles and save the details to a series of CSV files — one CSV file per newspaper.

In the previous instalment of this series I created a script to check the CSV output of my harvester for missing or duplicate dates. I extended this to perform a series of tests on each article based on the assumptions above. First, I filtered out articles on odd-numbered pages, then articles that were too short or too long. Finally I checked the remainder for missing or duplicate issue dates.

The details of the articles in each category were written out to JSON files. Using these files and a bit of JQuery magic I could quickly build a simple web interface that allowed me to explore the results.

Summary details of each newspaper

You can browse the summary results for the full list of newspapers, or you can drill down to view the actual articles assigned to each category.

Full details

I’ll save the full analysis for the next post, but if you play around with the results you quickly notice a few things. First, letters to the editor often include the name of the newspaper! If you look at The Mercury, for example, you’ll notice I’ve identified 1057 potential editorials — most of which are letters. Fortunately they should be fairly easy to filter out. In most cases the ‘even numbers only’ assumption worked pretty well, and the word length filters did remove quite a lot of false positives. There are still plenty of problems, but I’m encouraged enough to continue. Yes, there will be a Part #4!

 

2011 — the year of little sleep

2011 was a busy year. It’s hard to believe that it was only February when I first posted about my experiments mining the contents of the Trove newspaper database. Since then I’ve developed a set of digital tools, organised THATCamp Canberra, given a series of presentations on the possibilities of digital history, pushed ahead with Invisible Australians, and tried to develop my own digital research program. Oh yes, and endeavoured to earn enough money to feed the kids and pay the mortgage…

It looks like 2012 could be even busier, so before I lose track completely, I thought I’d pull together some of the past year’s exploits for handy reference. So here’s (most of) my presentations for 2011…

8 June 2011 — ‘Confessions of an impatient historian’
Scholars’ Lab, University of Virginia

18 August — ‘Digital history: new tools and techniques’
National Museum of Australia

24 August — ‘Hacking the archives’
Archival description in an online world, Recordkeeping Roundtable, Sydney

5 September 2011 — Digital research methods
Cultural heritage students, University of Canberra

14 September 2011 — ‘Every story has a beginning’
Keynote presentation at the Indexing See Change Conference (Australian and New Zealand Society of Editors)

13 November 2011 — ‘Digital history: new tools and techniques’
Dragontails 2011: 2nd Australasian conference on overseas Chinese history & heritage, Museum of Chinese Australian History, Melbourne

30 November 2011 — ‘It’s all about the stuff’
National Digital Forum, Wellington, New Zealand

7 December 2011 — ‘An introduction to digital history’
Digital December, State Library of NSW

It’s all about the stuff — the movie

Videos from NDF2011 are now available online. Here’s the movie version of my talk It’s all about the stuff. I seem to spend a lot of time in the shadows…

QueryPic

Back when I was looking at ‘When did the Great War become the First World War?‘ I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so…

Anyway, the result is a rather neat little gizmo henceforth named QueryPic (I got a bit sick of ‘search summariser’ and ‘graph-maker thing’). The first version just harvested data and left all the graph-making to you. But QueryPic does it all! It harvests the data and makes the graph. Woohoo.

Here’s an example showing ‘drought’ versus ‘flood’:

QueryPic features

  • Explore your Trove newspaper query over time in the form of a simple line graph.
  • Interactive — click on a point to retrieve sample articles from that date.
  • Combine data sources to compare queries.
  • Choose your interval — plot by year or month.
  • Switch views between total results and the proportion of all articles.

Running QueryPic

Yes, it’s a Python script and yes it runs on the command line. Let’s get that out of the way now. I don’t think I have the time and energy to develop cross-platform gui versions of all my tools. I’d rather spend the time adding new features or exploring new possibilities. Sorry, but until I have a wealthy benefactor or a technical support team, I think that’s the way it has to be. In any case, the code is all there – so build your own gui!

Actually, if I did have the time and energy I don’t think I’d build a standalone gui anyway. What would be much cooler would be a web service, where people could run, share and combine their queries. Social graph-making! A celebration of serendipity! A historical playground! Hmmm…

But for now there’s this python script. It’s dead easy to use. Starting from the beginning…

  1. Do you have Python installed? If you have a Mac or Linux the answer is yes. Fire up a terminal and type ‘python -V’ — see, I told you. If you have Windows you can get a handy installer. Do it.
  2. Get the source code. Just download this zip file and open it into a new folder.
  3. Open a terminal and cd into the new folder.
  4. Run ‘python do_totals.py [your Trove query]‘.
  5. Watch in excitement as the script chugs away retrieving data from Trove.
  6. Once the script is finished, go to the ‘graphs’ directory, where you’ll find your newly-created html page complete with fancy interactive graph.
  7. Open the html page in the web browser of your choice.
  8. Enjoy! Celebrate! Drink a toast in my honour!

Customising QueryPic

There are a number of optional arguments that you add to the command line to customise your results:

-n (or –name) [a query name]
Give a name to your query. The name is used to create filenames for the html and data files, it is also used in the legend of the graph. The default is to use the search keywords as the name.

-d (or –directory) [a directory path]
The full pathname of the directory/folder for your results. The default is a ‘graphs’ sub-directory in the current directory.

-g (or –graph) [a graph name]
Specify the name of the html file that’s created. This is useful for displaying multiple queries on a single graph. Just run QueryPic for each query, using the same graph name each time. The default is either the value specified by the -n parameter or a name derived from the search keywords.

-m (or –monthly)
Plot the query at monthly intervals. The default interval is a year.

What QueryPic actually does

QueryPic builds a simple visualisation of your search query in the Trove newspaper database. A list of search results is difficult to interpret and offers little context. QueryPic shows you the number of articles matching your query over time, enabling you reframe your questions, pursue hunches, or simply play around.

QueryPic takes your Trove newspaper query and looks for a date range. If it doesn’t find one, it assumes you want your graph to go from 1803 to 1954 (the complete contents of the newspaper database — except for the Women’s Weekly). QueryPic then strips out any date parameters from the query, so it can fire off the query within the start and end dates, at the specified date interval.

Date interval? In the previous version of this script you could only plot points at yearly intervals, so it was impossible to zoom in an see what might be happening over the span of a single year or two. But amazing advances in QueryPic technology mean you can now plot changes by month. Here for example is a new version of my Great War/First World War graph, focused on 1938–1946 and plotted at monthly intervals.

So for each interval within the date range QueryPic fires off a request to Trove. From the response it scrapes out the total number of results for that date. If the total is greater than zero, it then fires off a second request to find the total number of newspaper articles for that year. Your query results divided by the total number of articles gives the proportion of articles for that date matching your search query.

The number of results and the proportion are written to a javascript file, together with some other important information including the original query and the date the harvest was performed. Remember, the Trove newspapers database is always changing! QueryPic then grabs a copy of it’s own special html template and inserts a reference to this javascript file. For good measure, it also inserts a link to your original query. The file is saved under a new name, ready for you to open and explore.

The html file contains everything necessary to take your data and turn it into a graph. It does this using the HighCharts javascript library. Please note, that while licence conditions allow HighCharts to be redistributed as part of a non-commercial package, it is not free for commercial use. Check the HighCharts website for details.

Some examples

Plot ‘cat’ against ‘dog’ in a graph called ‘animals’:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -g "animals"
python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -g "animals"

Specify a directory for your results:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -d "/User/bill/Documents/graphs"

Plot results at monthly intervals:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat&fromyyyy=1920&toyyyy=1921" -m

Specify a name:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -n "Felines"