One of the advantages of building something yourself is that if you’re not happy with it you can tweak, change, modify and adapt until you are. But one of the disadvantages is that sometimes you get so caught up in all the tweaking, changing and adapting that you overlook a much simpler solution.
So I had a harvester that could save the publication details and content of all the newspaper articles in a search on Trove. But the warm glow of self-satisfaction quickly began to fade as I started to think about how I wanted to use the content I was harvesting.
The harvester saved the text of articles organised in directories by newspaper title. This seemed to make sense. It meant that you could easily analyse and compare the content of different newspapers. But what if you wanted to examine changes over time? In that case it’d be much easier if the articles were organised by year — then I could just pull out the a folder from a particular year, feed it to VoyeurTools, and start tracking the trends.
There ensued some minor tinkering. As a result, you can now you can pass an additional option to the harvest script, telling it whether to save the article texts and pdfs in directories by year or newspaper. Simply set the ‘zip-directory-structure’ option in harvest.ini to either ‘title’ or ‘year’. If you’re using the command-line you can use the ‘-d’ flag to set your preference. Easy.
But that set me wondering whether it might be possible to generate an overview, showing the number of articles matching a search over time. So I started on a modification of my harvest script that did just that — cycling through the search results, adding up the numbers. It wasn’t until I ran the new script for the first time that I realised there was a much simpler alternative.
All I needed to do was repeat the search for each year in the search span and grab the total results value from the page. D’uh…
So instead of sending hundreds or perhaps thousands of requests to Trove, all I needed was one for each year. From there it was easy and soon I had my first graph.
I was pretty pleased with that, but of course the raw numbers of articles on their own are rather misleading. The more interesting question was what proportion of the total number of articles for that year the search represents. Another quick tweak and I was grabbing the overall totals and calculating the proportions.
At this point I invited my Twitter followers to suggest some possible topics — you can see the results on Flickr.
But what do the peaks and troughs represent? I wanted to use the graphs as a way of exploring the content itself. This was possible as I’d saved the data as JSON and used jqPlot to create the graphs in an ordinary HTML page. Courtesy of some clever hooks in the backend of jqPlot I could capture the value of any point as it was clicked. That gave me the year, so all I had to do was combine this with the search keyword values and send off a request to Trove.
So now instead of just looking at the graphs, you could explore them.
Perhaps you’re wondering how I managed to pull the Trove results into the page? Just a bit of simple AJAX magic combined with my own unofficial Trove API. (More about that in the next exciting installment!)
I’ve created a little gallery of graphs to explore. I’m still open to suggestions!
The code for gathering the data is all on Bitbucket, so start building your own. Just run the ‘do_totals.py’ script in the bin directory from the command line. The script takes two flags:
- -q (–query) the url of your Trove search (compulsory)
- -f (–filename) the path and filename for your data file (don’t include an extension)
Of course it would be really nice to create a web service where people could create, share, compare and combine their graphs — but that might have to await a generous benefactor…