Extracting editorials #3

By my own criteria I’ve already failed… I started this series of posts with the intention of documenting the process of finding and extracting editorials as I was actually doing the work. But here I am about to describe some work I finished a few weeks back. Oh well…

In my previous instalments (here and here), I focused on the Sydney Morning Herald. Having continued the hunt for missing editorials I started in the last post, I’ve now got a CSV file with the urls of the first editorial published in every edition of the SMH from 1913. Good-o, I thought, I can now start harvesting and analysing some content.

But then ensued a crisis of faith. The whole point of this exercise was to be able to build up some comparisons  — between newspapers, between states, between the city and the bush. But the process of actually finding the editorials seemed beset with difficulties. Could the rules I developed for the SMH be applied elsewhere? Could I ever assemble a useful set of editorials without large amounts of human intervention? I decided to try a few quick experiments to see whether the whole project was worth pursuing.

I started with a few assumptions:

  1. The first (and only the first) editorial in any issue is headed with the name of the newspaper.
  2. Editorials are published on even numbered pages.
  3. Editorials vary in length between about 100 and 1500 words.

These assumptions were based on my own experience as a long-time newspaper researcher and on some preliminary poking around. For example, when I looked at The Argus I noticed that editorials were typically followed by news summaries. Unfortunately, these are treated as a single article in Trove, resulting in large blocks of text that are only part editorial. By specifying an upper word limit I hoped to filter these sorts of articles out. Similarly, there are sometimes brief announcements or publication details headed with the name of the newspaper. The lower word limit was intended to exclude these.

The next step was to harvest every article from 1913 that was headed with the name of its publication. I created a script to generate a list of all the newspapers that published issues in 1913. Then I called my existing harvester to download all the matching articles and save the details to a series of CSV files — one CSV file per newspaper.

In the previous instalment of this series I created a script to check the CSV output of my harvester for missing or duplicate dates. I extended this to perform a series of tests on each article based on the assumptions above. First, I filtered out articles on odd-numbered pages, then articles that were too short or too long. Finally I checked the remainder for missing or duplicate issue dates.

The details of the articles in each category were written out to JSON files. Using these files and a bit of JQuery magic I could quickly build a simple web interface that allowed me to explore the results.

Summary details of each newspaper

You can browse the summary results for the full list of newspapers, or you can drill down to view the actual articles assigned to each category.

Full details

I’ll save the full analysis for the next post, but if you play around with the results you quickly notice a few things. First, letters to the editor often include the name of the newspaper! If you look at The Mercury, for example, you’ll notice I’ve identified 1057 potential editorials — most of which are letters. Fortunately they should be fairly easy to filter out. In most cases the ‘even numbers only’ assumption worked pretty well, and the word length filters did remove quite a lot of false positives. There are still plenty of problems, but I’m encouraged enough to continue. Yes, there will be a Part #4!


This work is licensed under a Creative Commons Attribution 4.0 International License.

Tim Sherratt Written by:

I'm a historian and hacker who researches the possibilities and politics of digital cultural collections.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *