Extracting editorials #1

In their chapter in Writing History in the Digital Age, Trevor Owens and Fred Gibbs encourage historians to write about the ways they work with data — to document their methods, their working assumptions, their dead ends and their discoveries. It’s an important argument and one that makes me wonder again about forms of publication that might integrate narrative, methods and sources.

In the meantime though we have blogs. My problem is that I’m easily bored so by the time I get to the end of a project or experiment I’m already thinking about the next one. Going back and trying to write things up seems a bit of a chore (which is why I’m always way behind in my blog writing). Also leaving the writing to the end means that I tend to take shortcuts — leaving out some of the ‘boring’ procedural stuff or the ‘stupid’ ideas that just didn’t work.

But Trevor and Fred’s chapter has made me think I should be a bit more diligent, so as I start a new series of text-mining experiments I’ve decided to write things up as I’m doing them. So be warned, this could get messy…

So what do I want to do? You might not be surprised to learn that it’s another Trove newspaper database experiment. I want to see if I can harvest newspaper editorials over a certain period and then analyse these to build up a picture of what issues, events or ideas were perceived as important. As I’m currently looking at ways of harvesting digital sources relating to 1913 for an exhibition being developed by the National Museum of Australia, I’m going to start by focusing on 1913.

But editorials are opinion pieces, wouldn’t it be better to harvest ‘news’ articles?

First of all, I’m thinking that editorials will be fairly easy to identify and extract — there’s no real way in Trove to separate out current news from other sorts of articles. Secondly, I’m assuming that the issues that make it into editorials have some importance attached to them. Attached by whom, you may well ask — whose voice is being represented in the editorial? This is an important question and I’m thinking that it could be explored in interesting ways by harvesting editorials from a range of papers and regions. Thirdly, finding the editorials might actually help me find the major news articles, simply because in this period the main news stories were often on the page after the editorials.

So how do I find them? Looking at the Sydney Morning Herald for 1913, you can see that the editorials follow a regular pattern:

  • the first editorial is always headed with the name of the paper and the date, followed by the title
  • subsequent editorials that follow have a title but no subtitle (most other types of articles have a subtitle)
  • editorials are published on an even-numbered page, usually about half way through the newspaper

To check this I conducted a search for articles including ‘The Sydney Morning Herald’ in their title. The search returns 335 results. Of course we’d expect there to be 312 (6 x 52), but it looks like there’s quite a few false positives and some days missing altogether (presumably due to OCR errors). You can see there’s a fair bit of consistency in the pages that editorials appear on, but it doesn’t quite seem consistent enough to rely on. So I’ve decided that as a first step I’ll harvest all the articles from this query. I’ll then do some manual cleaning to remove the articles that aren’t editorials and try and identify and retrieve the missing days.

Remember, this won’t give me all the editorials, only the first editorial from each day. To get all the editorials, I’ll have to write a new script that will take this first result set, retrieve all the articles from the editorial page and then try to work out which of the articles are editorials — they should be the ones that come after the first editorial and have no subtitle. Or that’s the theory.

I’ve harvested the query. You can view the spreadsheet on Google Docs if you feel so moved.

[After I wrote the sentence above I checked the CSV file properly and realised I’d stuffed up. There’s a bit of a bug in my harvester that means if the query string you use includes a start value, the harvester wil retrieve the same page of results over and over again… I really need to fix that. 🙂 I’m now running it again. You wanted warts and all, right?] [After I wrote the paragraph above I checked my new harvest and realised I’d stuffed up again. There were only half as many results as there should have been! So I poked around and realised some recent changes I’d made to the harvest script meant I was only getting odd numbered results (I was incrementing the row value twice). A lesson in what happens when you do this stuff late at night… Trying again. ]
I’m not sure when I’ll have time to do the cleaning. But hey folks this is what research is like for people like me who have to try and fit it in around the edges of their lives. You can expect posts to come in sudden bursts and then dry up altogether for long periods as other priorities intrude.

 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tim Sherratt Written by:

I'm a historian and hacker who researches the possibilities and politics of digital cultural collections.

6 Comments

  1. November 22, 2011
    Reply

    I hope you have time to keep up blogging your process as well as your results – always interesting.

    I don’t know if any of this is helpful but I had some questions/thoughts as I read the above:

    A) I’ve tried to recreate your search but get different results depending on the approach I take – and I’ve not yet managed to recreate your exact search!

    For example I note that this search http://bit.ly/tphXME gets 294 hits (less than yours), but includes for example the editorial from the 2nd January which your search omits for some reason.

    I also found variance depending on whether I used the date range in the advanced search, specified the data via the faceted filtering

    B) Length of article looks like it might be another possible filter criteria – it seems on relatively brief inspection that editorials are longer than articles

    C) How do you tell an article is actually an Editorial if reading the paper? It seems impossible to actually tell systematically? (is it just a matter of ‘guessing’ when you read it?)

    D) I’m not familiar with how editorials would be structured, but I notice that there can be multiple ‘sections’ within an Editorial column with a single heading. For example this editorial http://trove.nla.gov.au/ndp/del/article/15388548 is headed ‘Traffic Growth’, but that is actually only the first part of the article. There is then a double rule, followed by (I think) 11 ‘sketches’ – short paras on different topics which are headed only by a short title followed by an em-dash (I think – could be en-dash?). Not sure how much this matters once you get into the textual analysis stage – but it seems a shame these aren’t separated out into articles.

    Just dumping some thoughts really – sorry if not helpful!

    • November 22, 2011
      Reply

      Thanks Owen, that’s very useful. In fact you’ve alerted me to another oddity in the way Trove’s search works. The difference between your query and mine is that I use the ‘fulltext’ modifier on the phrase. Fulltext switches off fuzzy matching (stemming etc) — I’ve gotten into the habit of using it because usually I’m looking for exact strings. However, it seems that the fulltext modifier overrides the ‘Search headings only’ option in the advanced search and so looks for matches in the whole article (perhaps that’s why it’s called ‘fulltext’???). This behaviour doesn’t seem to be documented. I wonder if that means there’s no way of searching for exact phrases in just the headings? I think I’ll post something to the Trove forums and see if I can get clarification. Anyway, that explains the false positives I was getting! Your search is cleaner, so I’ll start again and harvest that.

  2. November 22, 2011
    Reply

    Glad it was helpful 🙂

    If you can’t do an exact phrase search in just the headings, you could always add a quick check into the harvester (or whatever post-processing after the harvester has run) to do a check for the exact phrase using a regular expression – I’ve had to resort to this in the past when dealing with imprecise search interfaces.

  3. November 24, 2011
    Reply

    Hi Tim, this looks like an interesting first stage. But what are you trying to find out from the editorials? I know of someone who did some similar work; ‘mapping’ arguments in Op Ed articles. I’ll try and find it.

  4. […] recording my thoughts and assumptions in this way has already proved useful. In a comment, Owen Stephens noted that his attempt to reproduce my search query produced fewer results. After a little bit of poking […]

Leave a Reply

Your email address will not be published. Required fields are marked *