the shed
hacks

QueryPic

Back when I was looking at ‘When did the Great War become the First World War?‘ I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so…

Anyway, the result is a rather neat little gizmo henceforth named QueryPic (I got a bit sick of ‘search summariser’ and ‘graph-maker thing’). The first version just harvested data and left all the graph-making to you. But QueryPic does it all! It harvests the data and makes the graph. Woohoo.

Here’s an example showing ‘drought’ versus ‘flood’:

QueryPic features

  • Explore your Trove newspaper query over time in the form of a simple line graph.
  • Interactive — click on a point to retrieve sample articles from that date.
  • Combine data sources to compare queries.
  • Choose your interval — plot by year or month.
  • Switch views between total results and the proportion of all articles.

Running QueryPic

Yes, it’s a Python script and yes it runs on the command line. Let’s get that out of the way now. I don’t think I have the time and energy to develop cross-platform gui versions of all my tools. I’d rather spend the time adding new features or exploring new possibilities. Sorry, but until I have a wealthy benefactor or a technical support team, I think that’s the way it has to be. In any case, the code is all there – so build your own gui!

Actually, if I did have the time and energy I don’t think I’d build a standalone gui anyway. What would be much cooler would be a web service, where people could run, share and combine their queries. Social graph-making! A celebration of serendipity! A historical playground! Hmmm…

But for now there’s this python script. It’s dead easy to use. Starting from the beginning…

  1. Do you have Python installed? If you have a Mac or Linux the answer is yes. Fire up a terminal and type ‘python -V’ — see, I told you. If you have Windows you can get a handy installer. Do it.
  2. Get the source code. Just download this zip file and open it into a new folder.
  3. Open a terminal and cd into the new folder.
  4. Run ‘python do_totals.py [your Trove query]‘.
  5. Watch in excitement as the script chugs away retrieving data from Trove.
  6. Once the script is finished, go to the ‘graphs’ directory, where you’ll find your newly-created html page complete with fancy interactive graph.
  7. Open the html page in the web browser of your choice.
  8. Enjoy! Celebrate! Drink a toast in my honour!

Customising QueryPic

There are a number of optional arguments that you add to the command line to customise your results:

-n (or –name) [a query name]
Give a name to your query. The name is used to create filenames for the html and data files, it is also used in the legend of the graph. The default is to use the search keywords as the name.

-d (or –directory) [a directory path]
The full pathname of the directory/folder for your results. The default is a ‘graphs’ sub-directory in the current directory.

-g (or –graph) [a graph name]
Specify the name of the html file that’s created. This is useful for displaying multiple queries on a single graph. Just run QueryPic for each query, using the same graph name each time. The default is either the value specified by the -n parameter or a name derived from the search keywords.

-m (or –monthly)
Plot the query at monthly intervals. The default interval is a year.

What QueryPic actually does

QueryPic builds a simple visualisation of your search query in the Trove newspaper database. A list of search results is difficult to interpret and offers little context. QueryPic shows you the number of articles matching your query over time, enabling you reframe your questions, pursue hunches, or simply play around.

QueryPic takes your Trove newspaper query and looks for a date range. If it doesn’t find one, it assumes you want your graph to go from 1803 to 1954 (the complete contents of the newspaper database — except for the Women’s Weekly). QueryPic then strips out any date parameters from the query, so it can fire off the query within the start and end dates, at the specified date interval.

Date interval? In the previous version of this script you could only plot points at yearly intervals, so it was impossible to zoom in an see what might be happening over the span of a single year or two. But amazing advances in QueryPic technology mean you can now plot changes by month. Here for example is a new version of my Great War/First World War graph, focused on 1938–1946 and plotted at monthly intervals.

So for each interval within the date range QueryPic fires off a request to Trove. From the response it scrapes out the total number of results for that date. If the total is greater than zero, it then fires off a second request to find the total number of newspaper articles for that year. Your query results divided by the total number of articles gives the proportion of articles for that date matching your search query.

The number of results and the proportion are written to a javascript file, together with some other important information including the original query and the date the harvest was performed. Remember, the Trove newspapers database is always changing! QueryPic then grabs a copy of it’s own special html template and inserts a reference to this javascript file. For good measure, it also inserts a link to your original query. The file is saved under a new name, ready for you to open and explore.

The html file contains everything necessary to take your data and turn it into a graph. It does this using the HighCharts javascript library. Please note, that while licence conditions allow HighCharts to be redistributed as part of a non-commercial package, it is not free for commercial use. Check the HighCharts website for details.

Some examples

Plot ‘cat’ against ‘dog’ in a graph called ‘animals’:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -g "animals"
python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -g "animals"

Specify a directory for your results:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -d "/User/bill/Documents/graphs"

Plot results at monthly intervals:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat&fromyyyy=1920&toyyyy=1921" -m

Specify a name:

python do_totals.py "http://trove.nla.gov.au/newspaper/result?q=cat" -n "Felines"

Extracting editorials #2

As I explained in the first of this series, I’m documenting my efforts to extract every editorial published in the Sydney Morning Herald in 1913 from the Trove newspaper database. It’s an experiment both in text mining and historical writing — an attempt to put the method up front.

While I didn’t think there was anything very thrilling in the first instalment, recording my thoughts and assumptions in this way has already proved useful. In a comment, Owen Stephens noted that his attempt to reproduce my search query produced fewer results. After a little bit of poking around I realised that the fulltext modifier, which I often use to switch off fuzzy matching, counteracts the ‘search headings only’ flag. So my query was returning results that had the string ‘The Sydney Morning Herald’ anywhere in the article.

Try it for yourself.

Here’s my original query — searching for fulltext:”The Sydney Morning Herald” in headings only (supposedly). You’ll notice that it returns 335 results and it’s clear from a quick scan that a number are false positives (they don’t follow the pattern for editorials).

Here’s Owen’s query — searching for “The Sydney Morning Herald” in headings only. It returns 294 results, without any obvious false positives.

So my attempt to disable fuzzy matching actually produced a less accurate result! Weird.

Actually, I think one important benefit of this sort of text mining is that it helps you understand how the search engines you’re using actually work. Once you start poking and prodding, the idiosyncrasies start to emerge.

Anyway, I harvested Owen’s cleaner result set and opened up the resulting csv file. As it seemed in Trove, there we’re very few false positives. Indeed there were only two articles that didn’t seem to follow the standard editorial format, and these were notes added to the editorial page. On the other hand, there were obviously about 20 editorials missing. I could have manually worked through the csv file to identify the missing dates, but I thought I’d try to create some tools that would do the work for me.

What I wanted was the details of the first editorial in every edition of the newspaper in 1913 — so there should be one, and only one, article for each day on which the newspaper was published. I needed a tool that would analyse the csv file and do two things:

  • identify dates that occur multiple times (false positive alert!)
  • identify dates that are absent from the result set (missing in action!)

The resulting code is all on GitHub if you want follow along. I wrote a Python script that opens up the csv file, extracts all the date strings, converts them to datetime objects and then saves them to a list. Once that’s done it’s pretty easy to loop through and find duplicates:

def find_duplicates(list):
    '''
    Check a list for suplicate values.
    Returns a list of the duplicates.
    '''
    seen = set()
    duplicates = []
    for item in list:
        if item in seen:
            duplicates.append(item)
        seen.add(item)
    return duplicates

Finding missing dates was a little more complicated, but Google came to the rescue with some handy code samples. All I had to do was set a start and end date (in this case 1 January 1913 and 31 December 1913) and create a timedelta object equal to a day. Then it’s just a matter of adding the timedelta to the start date, comparing the new date to the dates extracted from the csv file, and continuing on until you hit the end. If the new date isn’t in the csv file, then it gets added to the missing list.

if year:
        start_date = datetime.date(year, 1, 1)
        end_date = datetime.date(year, 12, 31)
    else:
        start_date = article_dates[0]
        end_date = article_dates[-1]
    one_day = datetime.timedelta(days=1)
    this_day = start_date
    # Loop through each day in specified period to see if there's an article
    # If not, add to the missing_dates list.
    while this_day <= end_date:
        if this_day.weekday() not in exclude: #exclude Sunday
            if this_day not in article_dates:
                missing_dates.append(this_day)
        this_day += one_day

I’ve tried to make the code as reusable as possible, so you can either supply a year, or the script will read start and end dates from the csv file itself.

All that left me with two more lists of dates: ‘duplicates’ and ‘missing’. At first I just wrote these out to a text file, but then I decided it would be useful to write the results to an html page. That way I could add links that would take me to the actual issue within Trove, helping me to quickly find the missing editorial.

Unfortunately there’s no direct way to go from a date to an issue — you first need to find the issue identifier. How do you do this? If you dig around in the code beneath the page for each newspaper title, you’ll find that the ajax interface pulls in a json file with issue information. You can access this through a url like: http://trove.nla.gov.au/ndp/del/titlesOverDates/[year]/[month]. Here’s an example for January 1913.

The json includes all issues for all titles in the specified month. So you then have to loop through to find a specific title and day. Once you have the issue identifier you can just attach it to a url:

def get_issue_url(date, title_id):
    '''
    Gets the issue url given a title and date.
    '''
    year, month, day = date.timetuple()[:3]
    url = 'http://trove.nla.gov.au/ndp/del/titlesOverDates/%s/%02d' % (year, month)
    issues = json.load(urllib2.urlopen(url))
    for issue in issues:
        if issue['t'] == title_id and int(issue['p']) == day:
            issue_id = issue['iss']
    return 'http://trove.nla.gov.au/ndp/del/issue/%s' % issue_id

My results file with links to Trove

Finally, to save myself having to cut and paste the missing dates back into the csv file, I added a few lines to write them in automatically.

So now I have a handy little html page, complete with dates and links, that I’m working through to find all the missing editorials. All I need for the next stage are the urls for the editorial and the page on which it’s published. I’m just cutting and pasting these from the citation box in Trove into the csv file. Once this is done I can start trying to find all the editorials.

PS: I noted in my first post that one benefit in finding the editorials was that the main news articles usually appeared on the page after the editorials. I’ve been thinking some more about ways to identify ‘major’ news stories. Word length perhaps? But not always. Hmmm, but major stories do seem to be published at the top of the page. After a bit more poking around in the code I found that there’s a ‘y value’ assigned to each article that indicates its position on the page. So if I harvest all the articles on the page after the editorials and then rank them by their y values? Interesting…

Embedded archives

Some of you may have noticed that my Hacking a research project post featured a file from the National Archives of Australia embedded as a Cooliris widget. Huh? To jog your memory, here it is again:

These certificates allowed non-white Australians travelling overseas to re-enter the country. NAA: ST84/1, 1906/21-30

No, it’s not just an image, it’s a little 3D wall. You can pan and zoom to your heart’s content. You can enlarge an image, view fullscreen — you can even share an image via Twitter. Fun for all the family!

Regular viewers will recall my previous encounters with CoolIris — Archives in 3D and CoolIris enabled scrapbook — but these relied on having the CoolIris plugin installed. The embeddable Flash version wouldn’t work when the images were coming from the NAA because it upset Flash’s cross-domain settings.

So how did I get it to work? For various other projects I’ve been playing with simple image proxies using Python and Django, so I just applied the same principles. The image proxy makes it seem as if the images are coming from a local source, thus keeping Flash happy. Hurrah!

I’ve added a few little tweaks, so you can now view any digitised file in the National Archives of Australia in a CoolIris wall. Just go the the file browser page and enter a barcode. Even better you can install a bookmarklet. Just drag this link to your bookmarks bar (or save as a favourite) — View on wall. Then go to an item page in RecordSearch and click on the bookmarklet for 3D magic.

If you want to share a link to a file displayed in the 3D file browser, just use a url of the form:

http://wraggelabs.com/recordsearch/wall/[barcode]

— where [barcode] is fairly obviously the barcode of the file you want to view. For example:

If you want to embed one of the mini-walls in your blog post it’s easy. Just go to the CoolIris Express site and create your own wall. When it asks you for content source, click on ‘Media RSS’ and then in the ‘Feed URL’ box put:

http://wraggelabs.com/recordsearch/rss/[barcode]

– where [barcode] is… well, you know…

I think this a pretty interesting way to view, browse and navigate digitised files. Using Flash, rather than a browser plugin makes it more accessible, but I’d still rather have something based on open software and standards. I think it won’t be too long before we see something similar using Canvas and Javascript. That’ll be really exciting.

Doing it yourself

I was doing some research using the National Archives of Australia’s RecordSearch database the other day and became frustrated that there is no way of seeing how many pages are in a digitised file without clicking on the ‘Display digital copy’ link. So I fixed it.

As a userscript it’s hardly worthy of a blog post. All it does it find out how many pages are in the file and insert the number in the link text. It’s very simple. But I think it’s also a useful illustration of the changing balance of power between archives and their users.

William E Landis argued that archivists were ‘guilty as a profession of fetishising the outputs of our descriptive systems’. The design of finding aids have often been determined not by the needs of users but by a desire to faithfully represent the underlying archival architecture. But now users don’t have to just take what they’re given.

Technologies such as Greasemonkey are useful for sketching out alternatives. For organisations with IT systems that inhibit experimentation, Greasemonkey (or Mozilla’s Jetpack) provides a way of playing with interfaces without touching any of the underlying code. My rewrite of the way RecordSearch displays digitised files is an example of this.

But no one interface is ever going to meet the needs of all archive users. Fortunately, there are a growing number of ways in which archives can work in partnership with their users to help them create the interfaces they want and need.

Archives are starting to expose their data directly using APIs and linked open data. This gives users the power to create whole new applications. But I still think there’ll be a place for the little tweak – a simple hack that meets some small but specific need. I can imagine communities of interest building and sharing a range of tools, hacks, applications and interfaces specifically tailored to their research habits.

So if you don’t like it, fix it.

Some archives hacking

It’s great to see that the National Archives of Australia has released a large swag of data through the new data.australia.gov.au site. In the Commonwealth Agencies zip file you can find xml dumps of all the publicly accessible agency and series data in RecordSearch, as well as item data for series A1. This is the same data that Mitchell Whitelaw visualised so brilliantly in his Visible Archive project. There’s also item data and images from series A3560 – the Mildenhall photographs of early Canberra.

What’s even more exciting is that people are already using this data. At the recent GovHack event in Canberra the What The Federal Government Does team worked on visualising the activities of government by using functions data pulled from the agencies file. Another group has generated a really nice tag cloud and photo gallery from the Mildenhall data. With further GovHack sessions to follow and the MashupAustralia contest open until 13 November, let’s hope for some more inspired archives hacking.

Seeing RecordSearch data out in the world like this reminded me of a little project I started a while back and then set aside. It was a simple PHP script that scraped data from RecordSearch and spat it out either as XML or JSON. Mitchell used a version of this script in his A1 Explorer in order to find out the number of pages in each digitised file.

I’ve now expanded and improved the script so that it provides data on items, series, agencies and persons. The output includes all the basic fields as well as links between entities – such as related series, controlling agencies etc. As an added bonus you also get some useful totals (where they’re available): items include the number of pages, series include the number of items described on RecordSearch, and agencies include the number of series recorded. I’ve also fiddled with mod_rewrite to provide a more rest-ful interface.

For XML output use the url http://discontents.com.au/shed/rs/xml/ followed by the appropriate identifier – a barcode for an item, a CA number for an agency, a CP number for a person or a series number.

Some examples:

As you might have guessed, to get JSON output you just substitute ‘json’ for ‘xml’ in the url.

Being dependent on screen scraping, it’s inherently a bit fragile, but I’m hoping it might be of some use. My intention was to use it to start exploring some new ways of using and interacting with the data. The code itself is available at BitBucket. It’s not very elegant, but I don’t want to spend much time cleaning it up at the moment. If it seems like it might be useful, I’ll probably rewrite the whole thing in python and publish it through Google’s AppEngine.

Playing with pipes

The ever-informative Twitter alerted me recently to the History Trust of South Australia’s object of the month. It made me think that it would be nice if there was some way of bringing together all those objects, photos and documents featured by our cultural institutions. Some sort of combined RSS feed perhaps?

Something like this…

Well, yes… I couldn’t resist having a go. My tool of choice for this was Yahoo Pipes which has various modules for manipulating and creating RSS feeds. Check out my script on the Yahoo Pipes site to create a badge like this, play some more or inspect its innards. If you’re feeling adventurous you can even clone the script and tinker away yourself – it’s the best way to learn. Continue reading »

Cooliris-enabled scrapbook

There’s more 3D goodness for you to enjoy now that the Mapping our Anzacs scrapbook is Cooliris-enabled. If you have Cooliris installed, you’ll notice that the Cooliris icon on your browser toolbar lights up when you visit the site. Just click on the icon to browse all the photos posted to the scrapbook on a glorious 3D wall.

Scrapbook posts in 3D

Scrapbook posts in 3D

(If you don’t have Cooliris then go and get it. It can be used both in Internet Explorer and Firefox, though you’ll probably need to have admin rights to install for IE.)

Having given the 3D treatment to digitised files from the National Archives of Australia and portrait images from the Australian Dictionary of Biography, it wasn’t too hard to do. The scrapbook is a Tumblr site and the api makes it easy to extract all the photos. So I created a php file to gather all the details and then write them to a media-rss file. Then it was just a matter of  inserting a link to it in the scrapbook. Continue reading »

ADB DIY RSS

So I was thinking, wouldn’t it be nice if the Australian Dictionary of Biography‘s ‘born on this day‘ feature could be made available as an RSS feed. Every morning you’d get a new list of biographies delivered direct to your feed reader. And so…

[sounds of xpath wrangling and PHP coding]

here it is.

It’s pretty simple – it harvests all the links of people born on the current day, then loops through the links to gather the first paragraph of each biography. Then it’s just a matter of writing everything to an RSS file. Continue reading »

MoA buttons galore

Mapping our Anzacs, in case you don’t know, provides a Google map interface to the 375,000+ WWI service records held by the National Archives of Australia. Amongst other other things, you can add scrapbook posts to individual entries and create tributes. It’s meant to encourage exploration, so go on… explore!

If you’ll do, you’ll notice that there are direct links into the National Archives’ database RecordSearch. However, there are currently no links going to other way. Why does this matter? Well perhaps you’d like to use NameSearch to find an individual record, but then add a scrapbook post in Mapping our Anzacs. Up until now you had to find them all over again. But not any more…

Introducing our new range of ‘View in Mapping our Anzacs’ buttons:

  • For the discerning Firefox devotee we have a Greasemonkey userscript which adds a button to the RecordSearch item details page.
  • For fashion-challenged IE user we have a bookmarklet. Just right click on this link – View in Mapping our Anzacs – and save it as a favourite in your ‘Links’ folder (you may need to enable the ‘Links’ toolbar first by checking Tools > Toolbars > Links.)

Yes, it’s true… you could use the Bookmarklet with Firefox (just drag it to your bookmarks toolbar), but Greasemonkey is so much more chic.

Once you’re fully button-enabled just head into RecordSearch, find an item in series B2455 (the WWI service records) and click! Hurrah! You will be instantly transported to Mapping our Anzacs.

You can test out your new button by heading here:

Archives in 3D

All dressed up – RecordSearch has a new look

All dressed up – RecordSearch has a new look

The new version of my Greasemonkey userscript, RecordSearch Image Tools, gives RecordSearch’s digital image pages a rather new look. My previous version had done away with the tired ol ‘lemon-chiffon’ background colour, but I decided it was time to get a bit more adventurous, so I blitzed the old design and rebuilt the page from the beginning.

As you can see from the screenshot, I’ve tried to give the images as much as the screen as possible. I’ve also created a consistent set of navigation buttons, and improved the functionality in various ways. Continue reading »

RecordSearch tools broken!?

BREAKING NEWS (2.00pm, Monday, 8 December): RecordSearch seems to be back on the old subdomain, so now the userscript fix is not working! To be safe, I’ve updated the userscript again so that it will work on both the old and new subdomains. I’ll do the same with the Zotero translator, though for the time being it should be working. If you updated the userscript in the last few hours, you’d better do it again – sorry… Continue reading »