Mining the treasures of Trove (part 1)

Some time ago a well-meaning optometrist told me I had the eyes of a 60 year-old. I lay the blame for this premature ocular degeneration upon the many tiring hours I spent squinting at the screens of dodgy microfilm readers. Newspapers were a major source of my PhD research, and back then that meant learning a little too much about films, spools and lenses. Not to mention the unending struggle to capture and hold the best machines.

Now it’s different. Instead of spending some weeks, as I did, sampling the Australian Womens Weekly in the hope of finding relevant articles, I can go to the National Library’s Australian Newspapers database in Trove and do a keyword search. Easy. The eyesight of future historians is safe.

But ready access to millions of newspaper articles across 150 years brings new challenges. Used to coaxing evidence from a meager array of sources, historians now, as Dan Cohen notes, have to ‘grapple with abundance’. How do we use and understand our new documentary riches?

Fortunately there are a growing array of tools to help. Zotero helps us manage our sources. Voyeur Tools brings sophisticated text analysis techniques within the grasp of all. And where the tools we need do not exist, we can make them.

So that’s what I did.

I’m interested in the way we talk about the weather. Wouldn’t it be good, I was thinking a few weeks back, if I could harvest the content of newspaper articles about weather or climate and start to analyse it — looking for patterns and shifts, mapping correlations or divergences against the actual climatic record.

Well… why not?

There’s currently no API that allows you to access Trove in this way, though one is apparently under development. But being the impatient sod that I am, I’d already built most of the parts I needed. Some time ago I wrote a screen scraper to query the newspaper database and extract article details from the returned HTML. This scraper is used in the History Wall to display random newspaper articles. It’s also sitting behind my little Headline Roulette game.

I’ve been continuing to improve and refine the scraper and recently used it to create my own API to Trove hosted on Google’s AppEngine. I’ll be posting some more details about this soon. And yes, I also developed a Zotero translator for the newspapers site, which I promise to finish off!

So to make a harvester, all I needed to do was run my scraper over all the results in a search and save them in some useful form. It took me less than an hour to develop a working prototype. Since then I’ve been adding a few bells and whistles…

The other night I harvested about 1400 articles that included the phrase climate change. The harvester had saved the text content of the articles in a zip file, so I uploaded it to Voyeur Tools. Here’s a simple word cloud:

(corpus)

Or what about the ‘atomic age’ seen through major Australian newspapers in 1945-46?

(corpus)

Hmmm, this is fun…

It’s harvest time

But all you really want to know is how to do it, right? So after that overlong introduction, here’s everything you need to know.

What does the harvester do?

You feed the harvester the url of a search you’ve constructed in Trove. The harvester then loops through all the results pages, extracting the article details. These details are saved in a CSV (comma separated values) file that you should be able to open as a spreadsheet or import into a database. The fields in the CSV file currently are:

  • article id
  • article title
  • article url
  • newspaper
  • newspaper life dates and location
  • newspaper id
  • issue date
  • page reference
  • page url
  • number of user corrections to the OCR output
  • text of the article (including paragraph breaks)

In addition to the CSV file, the harvester can create two other data files for you. The first is a zip file that contains the text of all the articles, organised by newspaper and article. The internal structure of the zip is something like this:

[newspaper1 id]-[newspaper1 title]/
     [article1 id]-[article 1 issue date]-p[article1 page reference].txt
     [article2 id]-[article 2 issue date]-p[article2 page reference].txt
[newspaper2 id]-[newspaper2 title]/
     [article3 id]-[article 3 issue date]-p[article3 page reference].txt
     [article4 id]-[article 4 issue date]-p[article4 page reference].txt

For example:

35-The-Sydney-Morning-Herald/
     29765619-Friday-24-May-1946-p5.txt
     29763575-Friday-10-May-1946-p2.txt

Why is this useful? Once you have the texts organised in this format you can start feeding them to text-analysis programs. Voyeur Tools makes it easy, but there are other options like Mallet or NLTK. (Read about my first attempts at using NLTK over at NMA Labs.) As well as simple word frequencies and collocations you might want to investigate the possibilities of entity extraction, topic modelling or sentiment analysis.

The harvester also gives you the option of downloading a PDF version of every article. These are also saved in a zip file for convenience, with the same structure as above. Of course, if you’re harvesting a large number of articles this zip file might get very big.

Quick start (for the impatient)

If you just want to dive straight in here’s what you need to do:

  1. If you don’t have it already, install Python (v.2.7 is recommended, but other versions should work ok)
  2. Download my Trove Newspapers code
  3. Unzip the TroveNewspapers file and put the contents somewhere handy.
  4. Navigate to the trovenewspapers/bin directory.
  5. Open the file ‘harvest.ini’ with a text editor (Notepad will do).
  6. Change the default settings of ‘harvest.ini’ as instructed.
  7. Save ‘harvest.ini’.
  8. Find and run ‘do_harvest.py’.
  9. Sit back and watch as your harvest chugs away.

The boring details (for the cautious)

Install Python

My scraper and harvester are written in Python, so you’ll need to have it installed on your system. If you’re working in Linux you should already have it. Downloads are available for all other platforms. For example, if you’re in Windows just download and run the Windows x86 MSI Installer.

Download my code

Now download my Trove Newspapers code. It’s avaiable as a zip file, so unzip it and put the contents somewhere you can find it again. The contents look something like this:

trovenewspapers/
     data/
     harvests/
     __init.py__
     harvest.py
     retrieve.py
     utilities.py
     LICENSE.txt

We’ll talk about the bin directory shortly. The data directory contains information about the newspaper holdings available through Trove. This is used by the scraper. If you ever want to update this data, you can use the save_titles function in utilities.py, but it’s not important for the harvester.

The harvests directory is empty, but if you start a harvest without specifying an output location, this is where it’ll end up.

BeautifulSoup is an extremely useful Python library for screen scraping. The scraper relies heavily on it, so I’ve included a copy in the package for your convenience.

The two files that do all the work are harvest.py and retrieve.py. Open them up in an editor and have a look if you’re interested. The scraper logic is all in retrieve.py, while harvest.py builds and runs the harvester.

But if all you want to do is start harvesting you can ignore all this and head straight to the bin top-level directory. Here you’ll find two files, do_harvest.py and harvest.ini. You’ll also find a README file which contains another version of these instructions and some added documentation.

Set your harvest options

Open up harvest.ini in any old text editor. You’ll see it contains some instructions and a series of configuration options with default values. If you run a harvest with out changing the defaults you’ll generate a fascinating set of 28 articles that contain the phrase ‘Inclement Wragge’.

The options you can set are:

  1. query — the url of your Trove search
  2. filename — where you want to save the CSV file
  3. include-text — do you want to save the texts in a zip file (yes or no)?
  4. include-pdf — do you want to save pdfs of the articles in a zip file (yes or no)?
  5. start — the result number to start at (leave at 0 for a new harvest)

At this point you need to think — what do I actually want to harvest? Head over to the advanced search page for the newspapers database and start playing with the options until you get the results you want. Try to be as precise as possible — you don’t want to download lots of irrelevant articles.

Once you’re happy with your search, just copy the url in your browser’s location box. This url contains all the search parameters the harvester needs to find and process your results. Just paste the complete url into harvest.ini next to the ‘query’ option.

Set the filename option to tell the harvester where to save your CSV file. The harvester will use the filename you supply to build the filenames for the zip files (if you want them). If you don’t include a path the files will be saved in the bin directory. If you don’t set a filename, the harvester will create a default name — trove-newspapers-[timestamp].csv.

The ‘include-text’ and ‘include-pdf’ options should be pretty obvious. Set them to ‘yes’ if you want to save texts and pdfs, or ‘no’ if you don’t.

The ‘start’ option allows you to start your harvest at someplace other than the beginning of your results set. This is useful if your harvest is interrupted for any reason (more below). Just set it to the result number you want to start at.

Once your options are set, just save harvest.ini. It’s launch time!

Start your harvest

Remember the do_harvest.py file? It contains a little script that reads your configuration settings in harvest.ini and sends them off to the harvester. So to get things going all you need to do is run do_harvest.py.

How you actually do this depends a bit on your operating system and its settings. If you’re on Windows, then the python installation program should have told the OS to treat any file with a .py extension as a Python script. So you should just be able to double click it.

On Linux, the easiest way is to open up a terminal, cd to the bin directory and type ‘python do_harvest.py’.

That should be it. The script will let you know what’s going on, listing the articles as it processes them. Enjoy!

For lovers of the command line

The do_harvest script can also be run from the command line, with the various options supplied as arguments:

  • -q (or –query) [full url of Trove newspapers search].
  • -f (or –filename) [file and path name for the CSV output].
  • -t (or –text) Create a zip file containing the text of articles.
  • -p (or –pdf) Create a zip file containing pdfs of articles.
  • -s (or –start) The result number to start at.

For example:

python do_harvest.py -q http://trove.nla.gov.au/newspaper/result?exactPhrase=inclement+wragge -f /home/wragge/trove-output.csv -t -p

Command line arguments will override any of the settings in harvest.ini.

If you’re using Windows you’ll need to make sure that the location of your Python
installation is included in your Windows path variable.

If something goes wrong

If there are problems at the Trove end, the harvester will take a little 10 second nap before trying again. It’ll do this 10 times before it finally gives up. Just before it dies, the script will write some details out to an error file ([your filename]_error.txt), including some instructions on what to do next.

This error file will include the number of the last completed record. Simply insert this as the ‘start’ value in harvest.ini (or include on the command line with the -s flag) and run do_harvest.py again. The harvester will spring back into life.

You’ll probably have a duplicate row in your CSV file at the point where the harvest failed, but that’s easy to delete.

What’s next?

Please have a go and let me know how you fare. You can add comments here, or raise issues over at my Bitbucket repository.

I’m thinking about building a little GUI version if there’s enough interest, and I have a few other improvements in mind.

I’ll be posting more about my adventures hacking Trove, and also about my efforts to analyse the results of my harvests (hence the part 1).

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tim Sherratt Written by:

I'm a historian and hacker who researches the possibilities and politics of digital cultural collections.

11 Comments

  1. […] results of applying his tools.  If you want to do this yourself check out Wragge’s posts, Mining the Treasures of Trove (Part 1) and (Part 2). Firstly let’s look at Wragge’s graph of a topic that I have been writing […]

  2. […] results of applying his tools.  If you want to do this yourself check out Wragge’s posts, Mining the Treasures of Trove (Part 1) and (Part 2). Firstly let’s look at Wragge’s graph of a topic that I have been writing […]

  3. Oliver
    June 16, 2011
    Reply

    I’m desperate to try this out but cannot for the life of me figure it out (computer illiterate). Tell me, does your program work with macs? Nothing happens after I double-click the file and you list instructions (that I could see) for Windows and Linux only.

    Any chance of a quick hand in this?

    Thanks in advance.

    • June 17, 2011
      Reply

      Oliver,

      I don’t have access to a Mac so it’s hard for me to test, but I have had reports from people who have got it working on Macs. I’d suggest you try running the script from a terminal:

      • Open terminal
      • Move to directory containing script – cd trovenewspapers/bin
      • Run script – python do_harvest.py

      If that gives you any errors post them here and I’ll try and work out what’s going wrong.

  4. […] year. It’s hard to believe that it was only February when I first posted about my experiments mining the contents of the Trove newspaper database. Since then I’ve developed a set of digital tools, organised […]

  5. Jonathan O'Donnell
    February 9, 2012
    Reply

    Oliver, if you are still trying to do this, I can help you out. I’m on a Mac, and have successful downloaded articles with PDFs.

    It is fantastic!

  6. Tony Brady
    August 10, 2012
    Reply

    Hi Tim,

    Love your work. The Trove Tools make searching Newspapers for data supporting archival documents a breeze. My PhD will be all the better for your assistance and I will be sure to mention you in my introduction.

    My question relates to the Excel file produce in a search. I would like to use this file to bulk import references into EndNote XI (I know, an old version, but with the trouble changing versions can cause I am willing to stick with it for a few more months).

    Let me know if you have any advice on my question.

    Keep up the excellent work.

    Tony

Leave a Reply

Your email address will not be published. Required fields are marked *