<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>discontents &#187; the shed</title>
	<atom:link href="http://discontents.com.au/sections/shed/feed" rel="self" type="application/rss+xml" />
	<link>http://discontents.com.au</link>
	<description>working for the triumph of content over form, ideas over control, people over systems</description>
	<lastBuildDate>Tue, 24 Jan 2012 20:57:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>QueryPic</title>
		<link>http://discontents.com.au/shed/hacks/querypic</link>
		<comments>http://discontents.com.au/shed/hacks/querypic#comments</comments>
		<pubDate>Sat, 31 Dec 2011 15:08:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[hacks]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[Trove]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1546</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=QueryPic&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2012-01-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/querypic&amp;rft.language=English"></span>
Back when I was looking at &#8216;When did the Great War become the First World War?&#8216; I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so&#8230; Anyway, the result is a rather neat little [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=QueryPic&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2012-01-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/querypic&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1546"><!-- &nbsp; --></abbr>
<p>Back when I was looking at &#8216;<a title="When did the ‘Great War’ become the ‘First World War’?" href="http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war">When did the Great War become the First World War?</a>&#8216; I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so&#8230;</p>
<p>Anyway, the result is a rather neat little gizmo henceforth named <a href="http://wraggelabs.com/emporium/trove-tools/newspaper-search-summariser/">QueryPic</a> (I got a bit sick of &#8216;search summariser&#8217; and &#8216;graph-maker thing&#8217;). <a title="Mining the treasures of Trove (part 2)" href="http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2">The first version</a> just harvested data and left all the graph-making to you. But QueryPic does it all! It harvests the data <em>and</em> makes the graph. Woohoo.</p>
<p>Here&#8217;s an example showing &#8216;drought&#8217; versus &#8216;flood&#8217;:</p>
<p><a href="http://wraggelabs.com/shed/trove/newgraphs/flood_drought.html"><img class="aligncenter size-medium wp-image-1551" title="Screen Shot 2012-01-01 at 1.53.28 AM" src="http://discontents.com.au/wp-content/uploads/2012/01/Screen-Shot-2012-01-01-at-1.53.28-AM-250x166.png" alt="" width="250" height="166" /></a></p>
<h4>QueryPic features</h4>
<ul>
<li>Explore your Trove newspaper query over time in the form of a simple line graph.</li>
<li>Interactive &#8212; click on a point to retrieve sample articles from that date.</li>
<li>Combine data sources to compare queries.</li>
<li>Choose your interval &#8212; plot by year or month.</li>
<li>Switch views between total results and the proportion of all articles.</li>
</ul>
<h4>Running QueryPic</h4>
<p>Yes, it&#8217;s a Python script and yes it runs on the command line. Let&#8217;s get that out of the way now. I don&#8217;t think I have the time and energy to develop cross-platform gui versions of all my tools. I&#8217;d rather spend the time adding new features or exploring new possibilities. Sorry, but until I have a wealthy benefactor or a technical support team, I think that&#8217;s the way it has to be. In any case, <a href="https://github.com/wragge/Trove-newspapers">the code is all there </a>&#8211; so build your own gui!</p>
<p>Actually, if I did have the time and energy I don&#8217;t think I&#8217;d build a standalone gui anyway. What would be much cooler would be a web service, where people could run, share and combine their queries. Social graph-making! A celebration of serendipity! A historical playground! Hmmm&#8230;</p>
<p>But for now there&#8217;s this python script. It&#8217;s dead easy to use. Starting from the beginning&#8230;</p>
<ol>
<li>Do you have Python installed? If you have a Mac or Linux the answer is yes. Fire up a terminal and type &#8216;python -V&#8217; &#8212; see, I told you. If you have Windows you can get a <a href="http://www.python.org/getit/windows/">handy installer</a>. Do it.</li>
<li>Get the source code. Just <a href="https://github.com/wragge/Trove-newspapers/zipball/master">download this zip file</a> and open it into a new folder.</li>
<li>Open a terminal and cd into the new folder.</li>
<li>Run &#8216;python do_totals.py [your Trove query]&#8216;.</li>
<li>Watch in excitement as the script chugs away retrieving data from Trove.</li>
<li>Once the script is finished, go to the &#8216;graphs&#8217; directory, where you&#8217;ll find your newly-created html page complete with fancy interactive graph.</li>
<li>Open the html page in the web browser of your choice.</li>
<li>Enjoy! Celebrate! Drink a toast in my honour!</li>
</ol>
<h4>Customising QueryPic</h4>
<p>There are a number of optional arguments that you add to the command line to customise your results:</p>
<p><strong>-n (or &#8211;name) [a query name]<br />
</strong>Give a name to your query. The name is used to create filenames for the html and data files, it is also used in the legend of the graph. The default is to use the search keywords as the name.</p>
<p><strong>-d (or &#8211;directory) [a directory path]</strong><br />
The full pathname of the directory/folder for your results. The default is a &#8216;graphs&#8217; sub-directory in the current directory.</p>
<p><strong>-g (or &#8211;graph) [a graph name]</strong><br />
Specify the name of the html file that&#8217;s created. This is useful for displaying multiple queries on a single graph. Just run QueryPic for each query, using the same graph name each time. The default is either the value specified by the -n parameter or a name derived from the search keywords.</p>
<p><strong>-m (or &#8211;monthly)</strong><br />
Plot the query at monthly intervals. The default interval is a year.</p>
<h4>What QueryPic actually does</h4>
<p>QueryPic builds a simple visualisation of your search query in the Trove newspaper database. A list of search results is difficult to interpret and offers little context. QueryPic shows you the number of articles matching your query over time, enabling you reframe your questions, pursue hunches, or simply play around.</p>
<p>QueryPic takes your Trove newspaper query and looks for a date range. If it doesn&#8217;t find one, it assumes you want your graph to go from 1803 to 1954 (the complete contents of the newspaper database &#8212; except for the Women&#8217;s Weekly). QueryPic then strips out any date parameters from the query, so it can fire off the query within the start and end dates, at the specified date interval.</p>
<p>Date interval? In the previous version of this script you could only plot points at yearly intervals, so it was impossible to zoom in an see what might be happening over the span of a single year or two. But amazing advances in QueryPic technology mean you can now plot changes <em>by month</em>. Here for example is a new version of my Great War/First World War graph, focused on 1938&#8211;1946 and plotted at monthly intervals.</p>
<p><a href="http://wraggelabs.com/shed/trove/newgraphs/great_war_1938_46.html"><img class="aligncenter size-medium wp-image-1552" title="Screen Shot 2012-01-01 at 1.55.22 AM" src="http://discontents.com.au/wp-content/uploads/2012/01/Screen-Shot-2012-01-01-at-1.55.22-AM-250x166.png" alt="" width="250" height="166" /></a></p>
<p>So for each interval within the date range QueryPic fires off a request to Trove. From the response it scrapes out the total number of results for that date. If the total is greater than zero, it then fires off a second request to find the total number of newspaper articles for that year. Your query results divided by the total number of articles gives the proportion of articles for that date matching your search query.</p>
<p>The number of results and the proportion are written to a javascript file, together with some other important information including the original query and the date the harvest was performed. Remember, the Trove newspapers database is always changing! QueryPic then grabs a copy of it&#8217;s own special html template and inserts a reference to this javascript file. For good measure, it also inserts a link to your original query. The file is saved under a new name, ready for you to open and explore.</p>
<p>The html file contains everything necessary to take your data and turn it into a graph. It does this using the HighCharts javascript library. Please note, that while licence conditions allow HighCharts to be redistributed as part of a non-commercial package, it is not free for commercial use. Check the <a href="http://www.highcharts.com/">HighCharts website</a> for details.</p>
<h4>Some examples</h4>
<p>Plot &#8216;cat&#8217; against &#8216;dog&#8217; in a graph called &#8216;animals&#8217;:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -g &quot;animals&quot;
python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -g &quot;animals&quot;</pre>
<p>Specify a directory for your results:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -d &quot;/User/bill/Documents/graphs&quot;</pre>
<p>Plot results at monthly intervals:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&amp;fromyyyy=1920&amp;toyyyy=1921&quot; -m</pre>
<p>Specify a name:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -n &quot;Felines&quot;</pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/hacks/querypic/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Extracting editorials #2</title>
		<link>http://discontents.com.au/shed/hacks/extracting-editorials-2</link>
		<comments>http://discontents.com.au/shed/hacks/extracting-editorials-2#comments</comments>
		<pubDate>Mon, 19 Dec 2011 13:18:49 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[hacks]]></category>
		<category><![CDATA[1913editorials]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[Trove]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1515</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Extracting+editorials+%232&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2011-12-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/extracting-editorials-2&amp;rft.language=English"></span>
As I explained in the first of this series, I&#8217;m documenting my efforts to extract every editorial published in the Sydney Morning Herald in 1913 from the Trove newspaper database. It&#8217;s an experiment both in text mining and historical writing &#8212; an attempt to put the method up front. While I didn&#8217;t think there was anything [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Extracting+editorials+%232&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2011-12-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/extracting-editorials-2&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1515"><!-- &nbsp; --></abbr>
<p>As I explained in <a title="Extracting editorials #1" href="http://discontents.com.au/shoebox/digital-humanities/extracting-editorials-1">the first of this series</a>, I&#8217;m documenting my efforts to extract every editorial published in the <em>Sydney Morning Herald</em> in 1913 from the Trove newspaper database. It&#8217;s an experiment both in text mining and historical writing &#8212; an attempt to put the method up front.</p>
<p>While I didn&#8217;t think there was anything very thrilling in the first instalment, recording my thoughts and assumptions in this way has already proved useful. In a comment, <a href="http://discontents.com.au/shoebox/digital-humanities/extracting-editorials-1#comment-2371">Owen Stephens noted</a> that his attempt to reproduce my search query produced fewer results. After a little bit of poking around I realised that the fulltext modifier, which I often use to switch off fuzzy matching, counteracts the &#8216;search headings only&#8217; flag. So my query was returning results that had the string &#8216;The Sydney Morning Herald&#8217; anywhere in the article.</p>
<p>Try it for yourself.</p>
<p><a href="http://trove.nla.gov.au/newspaper/result?l-textSearchScope=headings+only%7Cscope%3Aheadings&amp;l-title=The+Sydney+Morning+Herald...%7Ctitleid%3A35&amp;l-word=*ignore*%7C*ignore*&amp;fromyyyy=1913&amp;toyyyy=1913&amp;sortby=dateAsc&amp;q=fulltext%3A%22The+Sydney+Morning+Herald%22&amp;l-category=Article%7Ccategory%3AArticle&amp;s=0">Here&#8217;s my original query</a> &#8212; searching for fulltext:&#8221;The Sydney Morning Herald&#8221; in headings only (supposedly). You&#8217;ll notice that it returns 335 results and it&#8217;s clear from a quick scan that a number are false positives (they don&#8217;t follow the pattern for editorials).</p>
<p><a href="http://trove.nla.gov.au/newspaper/result?l-textSearchScope=headings+only%7Cscope%3Aheadings&amp;l-title=The+Sydney+Morning+Herald...%7Ctitleid%3A35&amp;l-word=*ignore*%7C*ignore*&amp;fromyyyy=1913&amp;toyyyy=1913&amp;sortby=dateAsc&amp;l-category=Article%7Ccategory%3AArticle&amp;q=%22The+Sydney+Morning+Herald%22">Here&#8217;s Owen&#8217;s query</a> &#8212; searching for &#8220;The Sydney Morning Herald&#8221; in headings only. It returns 294 results, without any obvious false positives.</p>
<p>So my attempt to disable fuzzy matching actually produced a less accurate result! Weird.</p>
<p>Actually, I think one important benefit of this sort of text mining is that it helps you understand how the search engines you&#8217;re using actually work. Once you start poking and prodding, the idiosyncrasies start to emerge.</p>
<p>Anyway, I harvested Owen&#8217;s cleaner result set and opened up the resulting csv file. As it seemed in Trove, there we&#8217;re very few false positives. Indeed there were only two articles that didn&#8217;t seem to follow the standard editorial format, and these were notes added to the editorial page. On the other hand, there were obviously about 20 editorials missing. I could have manually worked through the csv file to identify the missing dates, but I thought I&#8217;d try to create some tools that would do the work for me.</p>
<p>What I wanted was the details of the first editorial in every edition of the newspaper in 1913 &#8212; so there should be one, and only one, article for each day on which the newspaper was published. I needed a tool that would analyse the csv file and do two things:</p>
<ul>
<li>identify dates that occur multiple times (false positive alert!)</li>
<li>identify dates that are absent from the result set (missing in action!)</li>
</ul>
<p>The resulting code is <a href="https://github.com/wragge/Trove-newspapers">all on GitHub</a> if you want follow along. I wrote a Python script that opens up the csv file, extracts all the date strings, converts them to datetime objects and then saves them to a list. Once that&#8217;s done it&#8217;s pretty easy to loop through and find duplicates:</p>
<pre class="brush: python">
def find_duplicates(list):
    &#039;&#039;&#039;
    Check a list for suplicate values.
    Returns a list of the duplicates.
    &#039;&#039;&#039;
    seen = set()
    duplicates = []
    for item in list:
        if item in seen:
            duplicates.append(item)
        seen.add(item)
    return duplicates
</pre>
<p>Finding missing dates was a little more complicated, but Google came to the rescue with some handy code samples. All I had to do was set a start and end date (in this case 1 January 1913 and 31 December 1913) and create a timedelta object equal to a day. Then it&#8217;s just a matter of adding the timedelta to the start date, comparing the new date to the dates extracted from the csv file, and continuing on until you hit the end. If the new date isn&#8217;t in the csv file, then it gets added to the missing list.</p>
<pre class="brush: python">
if year:
        start_date = datetime.date(year, 1, 1)
        end_date = datetime.date(year, 12, 31)
    else:
        start_date = article_dates[0]
        end_date = article_dates[-1]
    one_day = datetime.timedelta(days=1)
    this_day = start_date
    # Loop through each day in specified period to see if there&#039;s an article
    # If not, add to the missing_dates list.
    while this_day &lt;= end_date:
        if this_day.weekday() not in exclude: #exclude Sunday
            if this_day not in article_dates:
                missing_dates.append(this_day)
        this_day += one_day
</pre>
<p>I&#8217;ve tried to make the code as reusable as possible, so you can either supply a year, or the script will read start and end dates from the csv file itself.</p>
<p>All that left me with two more lists of dates: &#8216;duplicates&#8217; and &#8216;missing&#8217;. At first I just wrote these out to a text file, but then I decided it would be useful to write the results to an html page. That way I could add links that would take me to the actual issue within Trove, helping me to quickly find the missing editorial.</p>
<p>Unfortunately there&#8217;s no direct way to go from a date to an issue &#8212; you first need to find the issue identifier. How do you do this? If you dig around in the code beneath <a href="http://trove.nla.gov.au/ndp/del/title/35">the page for each newspaper title</a>, you&#8217;ll find that the ajax interface pulls in a json file with issue information. You can access this through a url like: http://trove.nla.gov.au/ndp/del/titlesOverDates/[year]/[month]. Here&#8217;s an example for <a href="http://trove.nla.gov.au/ndp/del/titlesOverDates/1913/01">January 1913</a>.</p>
<p>The json includes all issues for all titles in the specified month. So you then have to loop through to find a specific title and day. Once you have the issue identifier you can just attach it to a url:</p>
<pre class="brush: python">
def get_issue_url(date, title_id):
    &#039;&#039;&#039;
    Gets the issue url given a title and date.
    &#039;&#039;&#039;
    year, month, day = date.timetuple()[:3]
    url = &#039;http://trove.nla.gov.au/ndp/del/titlesOverDates/%s/%02d&#039; % (year, month)
    issues = json.load(urllib2.urlopen(url))
    for issue in issues:
        if issue[&#039;t&#039;] == title_id and int(issue[&#039;p&#039;]) == day:
            issue_id = issue[&#039;iss&#039;]
    return &#039;http://trove.nla.gov.au/ndp/del/issue/%s&#039; % issue_id
</pre>
<div id="attachment_1533" class="wp-caption alignright" style="width: 260px"><a href="http://discontents.com.au/wp-content/uploads/2011/12/Screen-Shot-2011-12-19-at-4.43.15-PM1.png"><img src="http://discontents.com.au/wp-content/uploads/2011/12/Screen-Shot-2011-12-19-at-4.43.15-PM1-250x469.png" alt="" title="Screen Shot 2011-12-19 at 4.43.15 PM" width="250" height="469" class="size-medium wp-image-1533" /></a><p class="wp-caption-text">My results file with links to Trove</p></div>
<p>Finally, to save myself having to cut and paste the missing dates back into the csv file, I added a few lines to write them in automatically.</p>
<p>So now I have a handy little html page, complete with dates and links, that I&#8217;m working through to find all the missing editorials. All I need for the next stage are the urls for the editorial and the page on which it&#8217;s published. I&#8217;m just cutting and pasting these from the citation box in Trove into the csv file. Once this is done I can start trying to find <strong>all</strong> the editorials.</p>
<p>PS: I noted in my first post that one benefit in finding the editorials was that the main news articles usually appeared on the page after the editorials. I&#8217;ve been thinking some more about ways to identify &#8216;major&#8217; news stories. Word length perhaps? But not always. Hmmm, but major stories do seem to be published at the top of the page. After a bit more poking around in the code I found that there&#8217;s a &#8216;y value&#8217; assigned to each article that indicates its position on the page. So if I harvest all the articles on the page after the editorials and then rank them by their y values? Interesting&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/hacks/extracting-editorials-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>the real face of white australia</title>
		<link>http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia</link>
		<comments>http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia#comments</comments>
		<pubDate>Tue, 20 Sep 2011 14:42:16 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[archives]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[facial detection]]></category>
		<category><![CDATA[invisibleaustralians]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1323</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=the+real+face+of+white+australia&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-09-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia&amp;rft.language=English"></span>
In many of the presentations I&#8217;ve given in recent times I&#8217;ve managed to include a question raised by Tim Hitchcock in his chapter in The Virtual Representation of the Past. Tim asks: What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=the+real+face+of+white+australia&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-09-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1323"><!-- &nbsp; --></abbr>
<p>In many of the presentations I&#8217;ve given in recent times I&#8217;ve managed to include a question raised by Tim Hitchcock in his chapter in <em>The Virtual Representation of the Past</em>. Tim asks:</p>
<blockquote><p>What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?</p></blockquote>
<p>The idea of turning archival systems on their head to expose the people rather than the bureaucracy is what motivates Kate Bagnall and I in our attempts to make the <a href="http://invisibleaustralians.org">Invisible Australians</a> project into a reality.</p>
<p><em>Invisible Australians</em> aims to liberate the lives of those who suffered under the restrictions of the White Australia Policy from the rich archival holdings of the National Archives of Australia and elsewhere.</p>
<p>We always knew that the portrait photographs, included on a range of government documents, would provide a compelling perspective on these lives, but we weren&#8217;t quite sure how we were going to extract them. Up until last weekend, I&#8217;d assumed that we&#8217;d develop a crowdsourcing tool that contributors would use to mark-up the photos.</p>
<p>Now I&#8217;m not so sure.</p>
<p>In the space of a couple of days I&#8217;ve extracted over 7,000 photographs and built an application to browse them &#8212; here is <a href="http://invisibleaustralians.org/faces/">the real face of White Australia</a>&#8230;</p>
<p><a href="http://invisibleaustralians.org/faces/"><img src="http://discontents.com.au/wp-content/uploads/2011/09/real_face-250x182.jpg" alt="" title="real_face" width="250" height="182" class="aligncenter size-medium wp-image-1325" /></a></p>
<p>How did I do it? Paul Hagon, at the National Library of Australia, <a href="http://www.paulhagon.com/blog/2010/03/11/everything-i-know-about-cataloguing-i-learned-from-watching-james-bond/">gave a presentation</a> last year in which he explored the possibilities of facial detection in developing access to photographic collections. The idea lodged in my brain somewhere and a few days ago I started to poke around looking to see how practical it might be for <em>Invisible Australians</em>.</p>
<p>It didn&#8217;t take long to find <a href="http://creatingwithcode.com/howto/face-detection-in-static-images-with-python/">a python script</a> that used the <a href="http://sourceforge.net/projects/opencvlibrary/">OpenCV library</a> to detect faces in photographs. I tried the script on a few of the NAA documents and was impressed &#8212; there were a few false positives, but the faces were being found!</p>
<p>So then the excitement kicked in. I modified the script so that instead of just finding the coordinates of faces it would enlarge the selected area by 50px on each side and then crop the image. This did a great job of extracting the portraits. I tweaked a few of the settings as well to try and reduce the number of false positives. Eventually, I developed a two-pass system that repeated the detection process after the image had been cropped and it&#8217;s contrast adjusted. This seemed to weed out a few more errors. You can <a href="https://github.com/wragge/Facial-detection">find the code</a> on GitHub.</p>
<p>Once the script was working I had to assemble the documents. I already had a basic harvester that would retrieve both the file metadata and digitised images for any series in the NAA database. Acting on Kate&#8217;s advice, I pointed it at series <a href="http://www.naa.gov.au/cgi-bin/Search?Number=ST84/1">ST84/1</a> and downloaded 12,502 page images.</p>
<p>All I then had to do was loop the facial detection script over the images. Simple! The only problem was that my 3-year-old laptop wasn&#8217;t quite up to the task. As it&#8217;s CPU temperature rose and rose, I was forced to employ a special high-tech cooling system.</p>
<div id="attachment_1329" class="wp-caption aligncenter" style="width: 260px"><a href="http://discontents.com.au/wp-content/uploads/2011/09/cooling.jpg"><img src="http://discontents.com.au/wp-content/uploads/2011/09/cooling-250x186.jpg" alt="" title="cooling" width="250" height="186" class="size-medium wp-image-1329" /></a><p class="wp-caption-text">Keeping my laptop alive...</p></div>
<p>But after running for several hours, my faithful old laptop finally worked it&#8217;s way through all the documents. The result was a directory full of 11,170 cropped images.</p>
<div id="attachment_1332" class="wp-caption aligncenter" style="width: 260px"><a href="http://discontents.com.au/wp-content/uploads/2011/09/faces_dir.jpg"><img src="http://discontents.com.au/wp-content/uploads/2011/09/faces_dir-250x147.jpg" alt="" title="faces_dir" width="250" height="147" class="size-medium wp-image-1332" /></a><p class="wp-caption-text">The results</p></div>
<p>There were still quite a lot of false positives and so I simply worked my way through the files, manually deleting the errors. I ended up with 7,247 photos of people. That&#8217;s a strike rate of nearly 65% which seems pretty good. The classifier, which does the actual facial detection, was probably trained on conventional photographs rather than on the mixed-format documents I was feeding it.</p>
<p>Then it was just a matter of building a web app to display the portraits. I used Django for the backend work of managing the metadata and delivering the content, while the interface was built using a combination or <a href="http://isotope.metafizzy.co/index.html">Isotope</a>, <a href="http://www.infinite-scroll.com/">Infinite Scroll</a> and <a href="http://fancybox.net/">FancyBox</a>.</p>
<p>It&#8217;s important to note that the portraits provide a way of exploring the records themselves. If you click on a face you see a copy of the document from which the photo was extracted. A link is provided to examine the full context of the image in RecordSearch. This is not just an exhibition, it&#8217;s a finding aid.</p>
<p>What next? There are many more of these documents to be harvested and processed (and many more still yet to be digitised). I will be adding more series as I can (though I might have to wait until I can afford a new computer!). I&#8217;d also like to explore the possibilities of facial or object detection a bit more. Could I train my own classifier? Could I detect handprints, or even classify the type of form?</p>
<p>In the meantime, I think our experimental browser helps us to understand why the <em>Invisible Australians</em> project is so important &#8212; you look at their faces and you simply want to know more. Who are they? What were their lives like?</p>
<p>UPDATE: For more on the photos and the issues they raise, see <a href="http://chineseaustralia.org/?cat=62">Kate Bagnall&#8217;s posts</a> over at the <a href="http://chineseaustralia.org/">Tiger&#8217;s Mouth</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia/feed</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>When did the &#8216;Great War&#8217; become the &#8216;First World War&#8217;?</title>
		<link>http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war</link>
		<comments>http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war#comments</comments>
		<pubDate>Mon, 29 Aug 2011 13:38:38 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[Trove]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1259</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=When+did+the+%26%238216%3BGreat+War%26%238217%3B+become+the+%26%238216%3BFirst+World+War%26%238217%3B%3F&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-08-29&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war&amp;rft.language=English"></span>
I&#8217;m interested in time &#8212; in the way we imagine, manipulate, experience and describe time, particularly in the service of ideas such as &#8216;progress&#8217;. This was one of the themes of Atomic Wonderland, but beyond constructing a few case studies it&#8217;s not all that easy to study. Or at least it wasn&#8217;t. Now projects such [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=When+did+the+%26%238216%3BGreat+War%26%238217%3B+become+the+%26%238216%3BFirst+World+War%26%238217%3B%3F&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-08-29&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1259"><!-- &nbsp; --></abbr>
<div id="attachment_1293" class="wp-caption alignright" style="width: 260px"><a href=" http://nla.gov.au/nla.news-article62826197"><img src="http://discontents.com.au/wp-content/uploads/2011/08/townsville-daily-bulletin-9-Dec-1939-250x322.png" alt="" title="townsville-daily-bulletin-9-Dec-1939" width="250" height="322" class="size-medium wp-image-1293" /></a><p class="wp-caption-text">Townsville Daily Bulletin, 9 December 1939</p></div>
<p>I&#8217;m interested in time &#8212; in the way we imagine, manipulate, experience and describe time, particularly in the service of ideas such as &#8216;progress&#8217;.</p>
<p>This was one of the themes of <a title="Atomic wonderland" href="http://discontents.com.au/shoebox/history-of-australian-science/atomic-wonderland">Atomic Wonderland</a>, but beyond constructing a few case studies it&#8217;s not all that easy to study. Or at least it wasn&#8217;t. Now projects such as <a href="http://victorianbooks.org/">Victorian Books</a> are showing how we can explore the changing weights of ideas across times and cultures by analysing the contents of large textual collections.</p>
<p>Returning visitors will be probably be aware of <a href="http://discontents.com.au/tag/trove">my own experiments</a> mining the contents of the National Library of Australia&#8217;s digitised newspapers database, available through <a href="http://trove.nla.gov.au/newspaper">Trove</a>. So far I&#8217;ve focused on the development of generic tools and techniques, but I thought it would be interesting to apply these to my study of &#8216;progress&#8217;. Happily the NLA agreed and have awarded me a <a href="http://www.nla.gov.au/harold-white-fellowships/2012-national-library-of-australia-fellowships-announced">Harold White Fellowship for 2012</a> to do just that. Yippee!</p>
<p>I&#8217;ll be taking up the fellowship in February, but in preparation I&#8217;ve started to develop a few little sketches that prod at our fondness for periodisation. Labels such as &#8216;the Roaring Twenties&#8217;, &#8216;the Great Depression&#8217; or even &#8216;the First World War&#8217; are so familiar that we sometimes forget that they themselves have a history.</p>
<p>To begin with I decided to examine the question of when &#8216;the Great War&#8217; became &#8216;the First World War&#8217;. At some point we realised that the Great War was not the final act in a centuries-long drama of European jealousy and jostling, but the first in a series of global conflicts. Can newspapers tell us when?</p>
<p>I <a href="http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2">already had a script</a> that would generate a basic time series from a Trove query string. It simply takes the query, fires off a separate search for each year and grabs the number of matching articles. If the number of matches is more than zero, it also retrieves the total number of articles for that year and calculates the proportion matching the query. The results are saved in a json file which can be easily visualised using something like <a href="http://www.highcharts.com/">HighCharts</a>. The original script needed a few tweaks to streamline the process, but I&#8217;ll describe these in detail in my next post.</p>
<p>For this experiment I constructed two queries. The first simply searched for the phrase &#8216;<a href="http://trove.nla.gov.au/newspaper/result?q=&#038;exactPhrase=the+great+war&#038;l-category=Article|category%3AArticle">the great war</a>&#8216; between 1900 and 1954. The second was a bit more complicated &#8212; it searched for <a href="http://trove.nla.gov.au/newspaper/result?l-category=Article|category%3AArticle&#038;sortby=dateAsc&#038;q=%22the+first+world+war%22+OR+%22world+war+one%22+OR+%22world+war+i%22+OR+%22world+war+1%22">any of the phrases</a> &#8216;first world war&#8217;, &#8216;world war one&#8217;, &#8216;world war 1&#8242; or &#8216;world war i&#8217; across the same period. I fed the queries to my script and after a bit of ker-chugging, whirring and clunking I ended up with a graph.</p>
<div id="attachment_1278" class="wp-caption alignright" style="width: 260px"><a href="http://wraggelabs.com/shed/time/the_great_war-2011-08-16.html"><img src="http://discontents.com.au/wp-content/uploads/2011/08/great_war_graph-252x300.jpg" alt="" title="When did the Great War become the First World War?" width="250" height="297" class="size-medium wp-image-1278" /></a><p class="wp-caption-text">Click to view the full interactive graph.</p></div>
<p>The result is not really surprising. As you can see <a href="http://wraggelabs.com/shed/time/the_great_war-2011-08-16.html">on the full graph</a>, the two lines cross late in 1941. With German victories across Europe and North Africa, the opening of the Eastern Front and, finally, the Japanese attack on Pearl Harbour, 1941 seems to make sense. But it&#8217;s interesting to see this reflected so clearly in such a rough and ready analysis.</p>
<p>What is perhaps more intriguing is the huge spike in 1939. Of course it makes sense that people would be referring back to the Great War as the prospect of a new conflict loomed, but it does make you wonder about the context of these discussions and how they might have developed as war edged closer.</p>
<p>Notable too are the earlier blips in the First World War count &#8212; the first centred on 1916 and the second on 1935. The peak in 1916 is actually due to the tags and comments added by Trove users. The standard &#8216;search everything&#8217; option in Trove includes these as well as the text of the articles themselves. By using other search options you can choose to exclude the tags that match your query, but that seems rather messy. It would be nicer if Trove gave you the option of ignoring these matches from the start.</p>
<div id="attachment_1286" class="wp-caption alignright" style="width: 260px"><a href="http://nla.gov.au/nla.news-article32886350"><img src="http://discontents.com.au/wp-content/uploads/2011/08/first_world_war-300x298.jpg" alt="" title="first_world_war" width="250" height="248" class="size-medium wp-image-1286" /></a><p class="wp-caption-text">The West Australian, 24 May 1935</p></div>
<p>The second blip is a bit more interesting. By clicking on the graph and exploring the results from Trove, you can see that it&#8217;s due to the screening of a documentary film called &#8216;<a href="http://www.imdb.com/title/tt0976117/">The First World War</a>&#8216;. The film used archival footage drawn from a number of nations and was based on Laurence Stalling&#8217;s book <em>The First World War: A Photographic History</em>. As one newspaper article noted: &#8216;this picture presents war, stripped of its gaudy trappings, and fearful in its grim reality&#8217;.</p>
<p>By way of comparison I <a href="http://ngrams.googlelabs.com/graph?content=the+Great+War%2Cthe+First+World+War&#038;year_start=1900&#038;year_end=1954&#038;corpus=0&#038;smoothing=0">tried a similar query</a> using the Google Books Ngram viewer. The crossover point seems a little later, but of course books take longer to publish than newspapers. There is, however, no peak in 1939 for &#8216;the Great War&#8217; &#8212; at least not if you use the combined &#8216;English&#8217; corpus. If you examine the British-English and American-English corpora separately it&#8217;s a rather different story. Querying the British-English corpus produces <a href="http://ngrams.googlelabs.com/graph?content=the+Great+War%2Cthe+First+World+War&#038;year_start=1900&#038;year_end=1954&#038;corpus=6&#038;smoothing=0">something much closer</a> to our Trove graph, complete with a spike around 1939. Again, this is only as we&#8217;d expect given the lesser significance of the First World War in American history. </p>
<p>This is, of course, only a sketch &#8212; something to prompt new questions or suggest avenues for attack. It&#8217;s made me want to find out a bit more about the nature of discussions in 1939, so I&#8217;ve fired up my <a href="http://wraggelabs.com/emporium/trove-tools/harvester/">Trove Newspaper Harvester</a> and downloaded the text of all 6,582 articles from 1939 that include the phrase &#8216;the Great War&#8217;. More about that soon&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Mining the treasures of Trove (part 2)</title>
		<link>http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2</link>
		<comments>http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2#comments</comments>
		<pubDate>Sun, 06 Mar 2011 13:44:02 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[Trove]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1174</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Mining+the+treasures+of+Trove+%28part+2%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-03-06&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2&amp;rft.language=English"></span>
One of the advantages of building something yourself is that if you&#8217;re not happy with it you can tweak, change, modify and adapt until you are. But one of the disadvantages is that sometimes you get so caught up in all the tweaking, changing and adapting that you overlook a much simpler solution. So I [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Mining+the+treasures+of+Trove+%28part+2%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-03-06&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1174"><!-- &nbsp; --></abbr>
<p>One of the advantages of building something yourself is that if you&#8217;re not happy with it you can tweak, change, modify and adapt until you are. But one of the disadvantages is that sometimes you get so caught up in all the tweaking, changing and adapting that you overlook a much simpler solution.</p>
<p>So <a href="http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1">I had a harvester</a> that could save the publication details and content of all the newspaper articles in a search on <a href="http://trove.nla.gov.au/newspaper?q=">Trove</a>. But the warm glow of self-satisfaction quickly began to fade as I started to think about how I wanted to use the content I was harvesting.</p>
<p>The harvester saved the text of articles organised in directories by newspaper title. This seemed to make sense. It meant that you could easily analyse and compare the content of different newspapers. But what if you wanted to examine changes over time? In that case it&#8217;d be much easier if the articles were organised by year &#8212; then I could just pull out the a folder from a particular year, feed it to <a href="http://voyeurtools.org/">VoyeurTools</a>, and start tracking the trends.</p>
<p>There ensued some minor tinkering. As a result, you can now you can pass an additional option to <a href="https://bitbucket.org/wragge/trove-tools/overview">the harvest script</a>, telling it whether to save the article texts and pdfs in directories by year or newspaper. Simply set the &#8216;zip-directory-structure&#8217; option in harvest.ini to either &#8216;title&#8217; or &#8216;year&#8217;. If you&#8217;re using the command-line you can use the &#8216;-d&#8217; flag to set your preference. Easy.</p>
<p>But that set me wondering whether it might be possible to generate an overview, showing the number of articles matching a search over time. So I started on a modification of my harvest script that did just that &#8212; cycling through the search results, adding up the numbers. It wasn&#8217;t until I ran the new script for the first time that I realised there was a much simpler alternative.</p>
<p>All I needed to do was repeat the search for each year in the search span and grab the total results value from the page. D&#8217;uh&#8230;</p>
<p>So instead of sending hundreds or perhaps thousands of requests to Trove, all I needed was one for each year. From there it was easy and soon I had my first graph.</p>
<div class="wp-caption aligncenter" style="width: 510px"><a href="http://www.flickr.com/photos/55336121@N00/5455553450/"><img title="Chinese in Australia - Trove graph" src="http://farm6.static.flickr.com/5180/5455553450_9fbd539d2f.jpg" alt="" width="500" height="330" /></a><p class="wp-caption-text">My first graph: Chinese in Australia (The Chinese Australian expert in my house predicted the 1888 peak.) </p></div>
<p>I was pretty pleased with that, but of course the raw numbers of articles on their own are rather misleading. The more interesting question was what proportion of the total number of articles for that year the search represents. Another quick tweak and I was grabbing the overall totals and calculating the proportions.</p>
<div class="wp-caption aligncenter" style="width: 510px"><a href="http://www.flickr.com/photos/55336121@N00/5455948202/"><img title="Trove graph - Chinese in Australia with proportions" src="http://farm6.static.flickr.com/5100/5455948202_5174a7a3de.jpg" alt="" width="500" height="336" /></a><p class="wp-caption-text">Total numbers versus proportions -- Chinese in Australia #2</p></div>
<p>At this point I invited my Twitter followers to suggest some possible topics &#8212; you can <a href="http://www.flickr.com/photos/55336121@N00/sets/72157626078999182/">see the results on Flickr</a>.</p>
<p>But what do the peaks and troughs represent? I wanted to use the graphs as a way of exploring the content itself. This was possible as I&#8217;d saved the data as JSON and used <a href="http://www.jqplot.com/">jqPlot</a> to create the graphs in an ordinary HTML page. Courtesy of some clever hooks in the backend of jqPlot I could capture the value of any point as it was clicked. That gave me the year, so all I had to do was combine this with the search keyword values and send off a request to <a href="http://trove.nla.gov.au/newspaper?q=">Trove</a>.</p>
<p>So now instead of just looking at the graphs, you could explore them.</p>
<div id="attachment_1185" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/shed/trove/graphs/chinese.html"><img class="size-medium wp-image-1185" title="chinese-graph" src="http://discontents.com.au/wp-content/uploads/2011/03/chinese-graph-300x218.png" alt="" width="300" height="218" /></a><p class="wp-caption-text">Explore -- Chinese in Australia #3</p></div>
<p>Perhaps you&#8217;re wondering how I managed to pull the Trove results into the page? Just a bit of simple AJAX magic combined with my own <a href="http://wraggelabs.appspot.com/api/newspapers/">unofficial Trove API</a>. (More about that in the next exciting installment!)</p>
<p>I&#8217;ve created a <a href="http://wraggelabs.com/shed/trove/graphs/">little gallery of graphs</a> to explore. I&#8217;m still open to suggestions!</p>
<p>The code for gathering the data is all on <a href="https://bitbucket.org/wragge/trove-tools/overview">Bitbucket</a>, so start building your own. Just run the &#8216;do_totals.py&#8217; script in the bin directory from the command line. The script takes two flags:</p>
<ul>
<li>-q (&#8211;query) the url of your Trove search (compulsory)</li>
<li>-f (&#8211;filename) the path and filename for your data file (don&#8217;t include an extension)</li>
</ul>
<p>The script will create a javascript file containing two JSON objects, &#8216;totals&#8217; and &#8216;ratios&#8217;. These can then be fed to jqPlot. View the source of one of my interactive graphs to see how.</p>
<p>Of course it would be really nice to create a web service where people could create, share, compare and combine their graphs &#8212; but that might have to await a generous benefactor&#8230;</p>
<p style="text-align: center;">
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Mining the treasures of Trove (part 1)</title>
		<link>http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1</link>
		<comments>http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1#comments</comments>
		<pubDate>Mon, 07 Feb 2011 15:07:10 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[the shed]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[screen scraping]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[Trove]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1088</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Mining+the+treasures+of+Trove+%28part+1%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2011-02-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1&amp;rft.language=English"></span>
Some time ago a well-meaning optometrist told me I had the eyes of a 60 year-old. I lay the blame for this premature ocular degeneration upon the many tiring hours I spent squinting at the screens of dodgy microfilm readers. Newspapers were a major source of my PhD research, and back then that meant learning [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Mining+the+treasures+of+Trove+%28part+1%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2011-02-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1088"><!-- &nbsp; --></abbr>
<p>Some time ago a well-meaning optometrist told me I had the eyes of a 60 year-old. I lay the blame for this premature ocular degeneration  upon the many tiring hours I spent squinting at the screens of dodgy microfilm readers. Newspapers were a major source of my <a href="http://discontents.com.au/shoebox/history-of-australian-science/atomic-wonderland">PhD research</a>, and back then that meant learning a little too much about films, spools and lenses. Not to mention the unending struggle to capture and hold the best machines.</p>
<p>Now it&#8217;s different. Instead of spending some weeks, as I did, sampling the <em>Australian Womens Weekly</em> in the hope of finding relevant articles, I can go to the National Library&#8217;s <a href="http://trove.nla.gov.au/newspaper">Australian Newspapers</a> database in Trove and do a <a href="http://trove.nla.gov.au/newspaper/result?q=&amp;exactPhrase=atomic+age&amp;anyWords=&amp;notWords=&amp;l-textSearchScope=*ignore*|*ignore*&amp;fromdd=&amp;frommm=&amp;fromyyyy=&amp;todd=&amp;tomm=&amp;toyyyy=&amp;l-title=|112&amp;l-word=*ignore*|*ignore*&amp;sortby=">keyword search</a>. Easy. The eyesight of future historians is safe.</p>
<p>But ready access to millions of newspaper articles across 150 years brings new challenges. Used to coaxing evidence from a meager array of sources, historians now, as <a href="http://www.journalofamericanhistory.org/issues/952/interchange/index.html">Dan Cohen notes</a>,  have to &#8216;grapple with abundance&#8217;. How do we use and understand our new documentary riches?</p>
<p>Fortunately there are a growing array of tools to help. <a href="http://www.zotero.org/">Zotero</a> helps us manage our sources. <a href="http://voyeurtools.org/">Voyeur Tools</a> brings sophisticated text analysis techniques within the grasp of all. And where the tools we need do not exist, <a href="http://niche-canada.org/programming-historian">we can make them</a>.</p>
<p>So that&#8217;s what I did.</p>
<p>I&#8217;m interested in the way we <a href="http://discontents.com.au/sections/shoebox/weather-research-topics">talk about the weather</a>. Wouldn&#8217;t it be good, I was thinking a few weeks back, if I could harvest the content of newspaper articles about weather or climate and start to analyse it &#8212; looking for patterns and shifts, mapping correlations or divergences against the actual climatic record.</p>
<p>Well&#8230; why not?</p>
<p>There&#8217;s <a href="http://trove.nla.gov.au/forum/showthread.php?81-New-section-suggestion-Hacking-Trove">currently no API</a> that allows you to access Trove in this way, though one is apparently under development. But being the impatient sod that I am, I&#8217;d already built most of the parts I needed. Some time ago I <a href="http://discontents.com.au/shed/experiments/headline-roulette">wrote a screen scraper</a> to query the newspaper database and extract article details from the returned HTML. This scraper is used in the <a href="http://labs.nma.gov.au/wall/">History Wall</a> to display random newspaper articles. It&#8217;s also sitting behind my little <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a> game.</p>
<p>I&#8217;ve been continuing to improve and refine the scraper and recently used it to create my own API to Trove hosted on Google&#8217;s  AppEngine. I&#8217;ll be posting some more details about this soon. And yes, I also developed a <a href="http://trove.nla.gov.au/forum/showthread.php?76-Zotero-translator-for-Australian-Newspapers-beta-version-for-testing">Zotero translator</a> for the newspapers site, which I promise to finish off!</p>
<p>So to make a harvester, all I needed to do was run my scraper over all the results in a search and save them in some useful form. It took me less than an hour to develop a working prototype. Since then I&#8217;ve been adding a few bells and whistles&#8230;</p>
<p>The other night I harvested about 1400 articles that included the phrase climate change. The harvester had saved the text content of the articles in a zip file, so I uploaded it to <a href="http://voyeurtools.org/">Voyeur Tools</a>. Here&#8217;s a simple word cloud:</p>
<p style="text-align: center;"><a href="http://voyeurtools.org/tool/Cirrus/?corpus=trove-climage-change&amp;stopList=stop.en.taporware.txt"><img class="aligncenter size-medium wp-image-1092" title="climate-change-cloud" src="http://discontents.com.au/wp-content/uploads/2011/02/climate-change-cloud-300x143.png" alt="" width="300" height="143" /></a> (<a href="http://voyeurtools.org/?corpus=trove-climage-change&amp;stopList=stop.en.taporware.txt">corpus</a>)</p>
<p>Or what about the &#8216;atomic age&#8217; seen through major Australian newspapers in 1945-46?</p>
<p style="text-align: center;"><a href="http://voyeurtools.org/tool/Cirrus/?corpus=1297051644653.4125&#038;stopList=stop.en.taporware.txt"><img class="aligncenter size-medium wp-image-1131" title="atomic_age" src="http://discontents.com.au/wp-content/uploads/2011/02/atomic_age-300x147.png" alt="" width="300" height="147" /></a>(<a href="http://voyeurtools.org/tool/CorpusSummary/?corpus=1297051644653.4125">corpus</a>)</p>
<p>Hmmm, this is fun&#8230;</p>
<h2>It&#8217;s harvest time</h2>
<p>But all you really want to know is how to do it, right? So after that overlong introduction, here&#8217;s everything you need to know.</p>
<h4>What does the harvester do?</h4>
<p>You feed the harvester the url of a search you&#8217;ve constructed in Trove. The harvester then loops through all the results pages, extracting the article details. These details are saved in a CSV (comma separated values) file that you should be able to open as a spreadsheet or import into a database. The fields in the CSV file currently are:</p>
<ul>
<li>article id</li>
<li>article title</li>
<li>article url</li>
<li>newspaper</li>
<li>newspaper life dates and location</li>
<li>newspaper id</li>
<li>issue date</li>
<li>page reference</li>
<li>page url</li>
<li>number of user corrections to the OCR output</li>
<li>text of the article (including paragraph breaks)</li>
</ul>
<p>In addition to the CSV file, the harvester can create two other data files for you. The first is a zip file that contains the text of all the articles, organised by newspaper and article. The internal structure of the zip is something like this:</p>
<pre>[newspaper1 id]-[newspaper1 title]/
     [article1 id]-[article 1 issue date]-p[article1 page reference].txt
     [article2 id]-[article 2 issue date]-p[article2 page reference].txt
[newspaper2 id]-[newspaper2 title]/
     [article3 id]-[article 3 issue date]-p[article3 page reference].txt
     [article4 id]-[article 4 issue date]-p[article4 page reference].txt</pre>
<p>For example:</p>
<pre>35-The-Sydney-Morning-Herald/
     29765619-Friday-24-May-1946-p5.txt
     29763575-Friday-10-May-1946-p2.txt</pre>
<p>Why is this useful? Once you have the texts organised in this format you can start feeding them to text-analysis programs. <a href="http://voyeurtools.org">Voyeur Tools</a> makes it easy, but there are other options like <a href="http://mallet.cs.umass.edu/">Mallet</a> or <a href="http://www.nltk.org/">NLTK</a>. (Read about <a href="http://labs.nma.gov.au/blog/2010/12/word-frequencies/">my first attempts at using NLTK</a> over at NMA Labs.) As well as simple word frequencies and collocations you might want to investigate the possibilities of entity extraction, topic modelling or sentiment analysis.</p>
<p>The harvester also gives you the option of downloading a PDF version of every article. These are also saved in a zip file for convenience, with the same structure as above. Of course, if you&#8217;re harvesting a large number of articles this zip file might get <em>very</em> big.</p>
<h4>Quick start (for the impatient)</h4>
<p>If you just want to dive straight in here&#8217;s what you need to do:</p>
<ol>
<li>If you don&#8217;t have it already, <a href="http://www.python.org/download/releases/2.7.1/">install Python</a> (v.2.7 is recommended, but other versions should work ok)</li>
<li>Download my <a href="https://github.com/wragge/Trove-newspapers">Trove Newspapers code</a></li>
<li>Unzip the TroveNewspapers file and put the contents somewhere handy.</li>
<li>Navigate to the trovenewspapers/bin directory.</li>
<li>Open the file &#8216;harvest.ini&#8217; with a text editor (Notepad will do).</li>
<li>Change the default settings of &#8216;harvest.ini&#8217; as instructed.</li>
<li>Save &#8216;harvest.ini&#8217;.</li>
<li>Find and run &#8216;do_harvest.py&#8217;.</li>
<li>Sit back and watch as your harvest chugs away.</li>
</ol>
<h4>The boring details (for the cautious)</h4>
<h5>Install Python</h5>
<p>My scraper and harvester are written in Python, so you&#8217;ll need to have it installed on your system. If you&#8217;re working in Linux you should already have it. <a href="http://www.python.org/download/releases/2.7.1/">Downloads are available</a> for all other platforms. For example, if you&#8217;re in Windows just download and run the <a href="http://www.python.org/ftp/python/2.7.1/python-2.7.1.msi">Windows x86 MSI Installer</a>.</p>
<h5>Download my code</h5>
<p>Now download <a href="https://github.com/wragge/Trove-newspapers">my Trove Newspapers code</a>. It&#8217;s avaiable as a zip file, so unzip it and put the contents somewhere you can find it again. The contents look something like this:</p>
<pre>trovenewspapers/
     data/
     harvests/
     __init.py__
     harvest.py
     retrieve.py
     utilities.py
     LICENSE.txt</pre>
<p><del datetime="2012-01-24T06:17:19+00:00">We&#8217;ll talk about the bin directory shortly.</del> The data directory contains information about the newspaper holdings available through Trove. This is used by the scraper. If you ever want to update this data, you can use the save_titles function in utilities.py, but it&#8217;s not important for the harvester.</p>
<p>The harvests directory is empty, but if you start a harvest without specifying an output location, this is where it&#8217;ll end up.</p>
<p><a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> is an extremely useful Python library for screen scraping. The scraper relies heavily on it, so I&#8217;ve included a copy in the package for your convenience.</p>
<p>The two files that do all the work are harvest.py and retrieve.py. Open them up in an editor and have a look if you&#8217;re interested. The scraper logic is all in retrieve.py, while harvest.py builds and runs the harvester.</p>
<p>But if all you want to do is start harvesting you can ignore all this and head straight to the <del datetime="2012-01-24T06:17:19+00:00">bin</del> top-level directory. Here you&#8217;ll find two files, do_harvest.py and harvest.ini. You&#8217;ll also find a README file which contains another version of these instructions and some added documentation.</p>
<h5>Set your harvest options</h5>
<p>Open up harvest.ini in any old text editor. You&#8217;ll see it contains some instructions and a series of configuration options with default values. If you run a harvest with out changing the defaults you&#8217;ll generate a fascinating set of 28 articles that contain the phrase &#8216;Inclement Wragge&#8217;.</p>
<p>The options you can set are:</p>
<ol>
<li>query &#8212; the url of your Trove search</li>
<li>filename &#8212; where you want to save the CSV file</li>
<li>include-text &#8212; do you want to save the texts in a zip file (yes or no)?</li>
<li>include-pdf &#8212; do you want to save pdfs of the articles in a zip file (yes or no)?</li>
<li>start &#8212; the result number to start at (leave at 0 for a new harvest)</li>
</ol>
<p>At this point you need to think &#8212; what do I actually want to harvest? Head over to the <a href="http://trove.nla.gov.au/ndp/del/search?adv=y">advanced search page</a> for the newspapers database and start playing with the options until you get the results you want. Try to be as precise as possible &#8212; you don&#8217;t want to download lots of irrelevant articles.</p>
<p>Once you&#8217;re happy with your search, just copy the url in your browser&#8217;s location box. This url contains all the search parameters the harvester needs to find and process your results. Just paste the complete url into harvest.ini next to the &#8216;query&#8217; option.</p>
<p>Set the filename option to tell the harvester where to save your CSV file. The harvester will use the filename you supply to build the filenames for the zip files (if you want them). If you don&#8217;t include a path the files will be saved in the bin directory. If you don&#8217;t set a filename, the harvester will create a default name &#8212; trove-newspapers-[timestamp].csv.</p>
<p>The &#8216;include-text&#8217; and &#8216;include-pdf&#8217; options should be pretty obvious. Set them to &#8216;yes&#8217; if you want to save texts and pdfs, or &#8216;no&#8217; if you don&#8217;t.</p>
<p>The &#8216;start&#8217; option allows you to start your harvest at someplace other than the beginning of your results set. This is useful if your harvest is interrupted for any reason (more below). Just set it to the result number you want to start at.</p>
<p>Once your options are set, just save harvest.ini. It&#8217;s launch time!</p>
<h5>Start your harvest</h5>
<p>Remember the do_harvest.py file? It contains a little script that reads your configuration settings in harvest.ini and sends them off to the harvester. So to get things going all you need to do is run do_harvest.py.</p>
<p>How you actually do this depends a bit on your operating system and its settings. If you&#8217;re on Windows, then the python installation program should have told the OS to treat any file with a .py extension as a Python script. So you should just be able to double click it.</p>
<p>On Linux, the easiest way is to open up a terminal, cd to the bin directory and type &#8216;python do_harvest.py&#8217;.</p>
<p>That should be it. The script will let you know what&#8217;s going on, listing the articles as it processes them. Enjoy!</p>
<h5>For lovers of the command line</h5>
<p>The do_harvest script can also be run from the command line, with the various options supplied as arguments:</p>
<ul>
<li>-q (or &#8211;query) [full url of Trove newspapers search].</li>
<li>-f (or &#8211;filename) [file and path name for the CSV output].</li>
<li>-t (or &#8211;text) Create a zip file containing the text of articles.</li>
<li>-p (or &#8211;pdf) Create a zip file containing pdfs of articles.</li>
<li>-s (or &#8211;start) The result number to start at.</li>
</ul>
<p>For example:<br />
<code><br />
python do_harvest.py -q http://trove.nla.gov.au/newspaper/result?exactPhrase=inclement+wragge -f /home/wragge/trove-output.csv -t -p<br />
</code></p>
<p>Command line arguments will override any of the settings in harvest.ini.</p>
<p>If you&#8217;re using Windows you&#8217;ll need to make sure that the location of your Python<br />
installation is included in your Windows path variable.</p>
<h5>If something goes wrong</h5>
<p>If there are problems at the Trove end, the harvester will take a little 10 second nap before trying again. It&#8217;ll do this 10 times before it finally gives up. Just before it dies, the script will write some details out to an error file ([your filename]_error.txt), including some instructions on what to do next.</p>
<p>This error file will include the number of the last completed record. Simply insert this as the &#8216;start&#8217; value in harvest.ini (or include on the command line with the -s flag) and run do_harvest.py again. The harvester will spring back into life.</p>
<p>You&#8217;ll probably have a duplicate row in your CSV file at the point where the harvest failed, but that&#8217;s easy to delete.</p>
<h3>What&#8217;s next?</h3>
<p>Please have a go and let me know how you fare. You can add comments here, or raise issues over at <a href="https://bitbucket.org/wragge/trove-tools">my Bitbucket repository</a>.</p>
<p>I&#8217;m thinking about building a little GUI version if there&#8217;s enough interest, and I have a few other improvements in mind.</p>
<p>I&#8217;ll be posting more about my adventures hacking Trove, and also about my efforts to analyse the results of my harvests (hence the part 1).</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>THATCamp is coming to Australia</title>
		<link>http://discontents.com.au/shed/events/thatcamp-is-coming-to-australia</link>
		<comments>http://discontents.com.au/shed/events/thatcamp-is-coming-to-australia#comments</comments>
		<pubDate>Wed, 21 Jul 2010 23:21:33 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[events]]></category>
		<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[thatcamp]]></category>
		<category><![CDATA[unconference]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=960</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=THATCamp+is+coming+to+Australia&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=events&amp;rft.source=discontents&amp;rft.date=2010-07-22&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/events/thatcamp-is-coming-to-australia&amp;rft.language=English"></span>
One of the things that&#8217;s keeping me busy at the moment is THATCamp Canberra. Yes, I got sick of missing out on all the THATCamp fun happening elsewhere and decided we should have our own. THATCamp Canberra is a user-generated unconference on the digital humanities. It&#8217;ll be held at the University of Canberra on 28–29 [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=THATCamp+is+coming+to+Australia&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=events&amp;rft.source=discontents&amp;rft.date=2010-07-22&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/events/thatcamp-is-coming-to-australia&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=960"><!-- &nbsp; --></abbr>
<p>One of the things that&#8217;s keeping me busy at the moment is <a href="http://thatcampcanberra.org/">THATCamp Canberra</a>. Yes, I got sick of missing out on all the <a href="http://thatcamp.org/">THATCamp fun</a> happening elsewhere and decided we should have our own.</p>
<p><a href="http://thatcampcanberra.org"><img class="aligncenter size-medium wp-image-963" title="thatcamp_cbr_logo" src="http://discontents.com.au/wp-content/uploads/2010/07/thatcamp_cbr_logo-300x250.jpg" alt="" width="300" height="250" /></a></p>
<p>THATCamp Canberra is a user-generated unconference on the digital humanities. It&#8217;ll be held at the University of Canberra on 28–29 August. We&#8217;re getting a great mix of applications and I&#8217;m really looking forward to learning about what&#8217;s going on around Australia.</p>
<p>Applications close on 23 July, so get yours in soon!</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/events/thatcamp-is-coming-to-australia/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Embedded archives</title>
		<link>http://discontents.com.au/shed/hacks/embedded-archives</link>
		<comments>http://discontents.com.au/shed/hacks/embedded-archives#comments</comments>
		<pubDate>Sun, 27 Jun 2010 12:00:17 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[hacks]]></category>
		<category><![CDATA[archives]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[recordsearch]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=932</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Embedded+archives&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2010-06-27&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/embedded-archives&amp;rft.language=English"></span>
Some of you may have noticed that my Hacking a research project post featured a file from the National Archives of Australia embedded as a Cooliris widget. Huh? To jog your memory, here it is again: No, it&#8217;s not just an image, it&#8217;s a little 3D wall. You can pan and zoom to your heart&#8217;s [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Embedded+archives&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2010-06-27&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/embedded-archives&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=932"><!-- &nbsp; --></abbr>
<p>Some of you may have noticed that my <a href="http://discontents.com.au/shed/experiments/hacking-a-research-project">Hacking a research project</a> post featured a file from the <a href="http://naa.gov.au/">National Archives of Australia</a> embedded as a <a href="http://cooliris.com/">Cooliris</a> widget. Huh? To jog your memory, here it is again:</p>
<div class="wp-caption aligncenter" style="width: 470px">
<img style="visibility:hidden;width:0px;height:0px;" border=0 width=0 height=0 src="http://counters.gigya.com/wildfire/IMP/CXNID=2000002.11NXC/bT*xJmx*PTEyNzY3NzEwMDA5MjQmcHQ9MTI3Njc3MTAwNTYyOSZwPTkwMjA1MSZkPSZnPTEmb2Y9MA==.gif" /><object id="ci_10145_o" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" width="460" height="300"><param name="movie" value="http://apps.cooliris.com/embed/cooliris.swf"/><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><param name="bgColor" value="#121212" /><param name="flashvars" value="feed=http%3A%2F%2Fwraggelabs.com%2Frecordsearch%2Frss%2F7473965%2F%3Fpages%3D70%26ref%3DST84%2F1%2C%25201906%2F221-230&numrows=2" /><param name="wmode" value="opaque" /><embed id="ci_10145_e" type="application/x-shockwave-flash" src="http://apps.cooliris.com/embed/cooliris.swf" width="460" height="300" allowFullScreen="true" allowScriptAccess="always" bgColor="#121212" flashvars="feed=http%3A%2F%2Fwraggelabs.com%2Frecordsearch%2Frss%2F7473965%2F%3Fpages%3D70%26ref%3DST84%2F1%2C%25201906%2F221-230&numrows=2" wmode="opaque"></embed></object>
<p class="wp-caption-text">These certificates allowed non-white Australians travelling overseas to re-enter the country. NAA: ST84/1, 1906/21-30</p></div>
<p>No, it&#8217;s not just an image, it&#8217;s a little 3D wall. You can pan and zoom to your heart&#8217;s content. You can enlarge an image, view fullscreen &#8212; you can even share an image via Twitter. Fun for all the family!</p>
<p>Regular viewers will recall my previous encounters with CoolIris &#8212; <a href="http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d">Archives in 3D</a> and <a href="http://discontents.com.au/shed/hacks/cooliris-enabled-scrapbook">CoolIris enabled scrapbook</a> &#8212; but these relied on having the CoolIris plugin installed. The embeddable Flash version wouldn&#8217;t work when the images were coming from the NAA because it upset Flash&#8217;s cross-domain settings.</p>
<p>So how did I get it to work? For various other projects I&#8217;ve been playing with simple image proxies using Python and Django, so I just applied the same principles. The image proxy makes it seem as if the images are coming from a local source, thus keeping Flash happy. Hurrah!</p>
<p>I&#8217;ve added a few little tweaks, so you can now view any digitised file in the National Archives of Australia in a CoolIris wall. Just go the the <a href="http://wraggelabs.com/recordsearch/wall/">file browser page</a> and enter a barcode. Even better you can install a bookmarklet. Just drag this link to your bookmarks bar (or save as a favourite) &#8212; <a href="javascript:(function(){window.location='http://wraggelabs.com/recordsearch/wall/'+document.evaluate('//td[b=&quot;Barcode&quot;]',document,null,XPathResult.FIRST_ORDERED_NODE_TYPE,null).singleNodeValue.lastChild.textContent})();">View on wall</a>. Then go to an item page in <a href="http://naa.gov.au/collection/recordsearch/index.aspx">RecordSearch</a> and click on the bookmarklet for 3D magic.</p>
<p>If you want to share a link to a file displayed in the 3D file browser, just use a url of the form:</p>
<p><code>http://wraggelabs.com/recordsearch/wall/[barcode]</code></p>
<p> &#8212; where [barcode] is fairly obviously the barcode of the file you want to view. For example:</p>
<ul>
<li><a href="http://wraggelabs.com/recordsearch/wall/3445411/">http://wraggelabs.com/recordsearch/wall/3445411/</a></li>
</ul>
<p>If you want to embed one of the mini-walls in your blog post it&#8217;s easy. Just go to the <a href="http://www.cooliris.com/yoursite/express/">CoolIris Express</a> site and create your own wall. When it asks you for content source, click on &#8216;Media RSS&#8217; and then in the &#8216;Feed URL&#8217; box put:</p>
<p><code>http://wraggelabs.com/recordsearch/rss/[barcode]</code></p>
<p>&#8211; where [barcode] is&#8230; well, you know&#8230;</p>
<p>I think this a pretty interesting way to view, browse and navigate digitised files. Using Flash, rather than a browser plugin makes it more accessible, but I&#8217;d still rather have something based on open software and standards. I think it won&#8217;t be too long before we see something similar using Canvas and Javascript. That&#8217;ll be really exciting.</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/hacks/embedded-archives/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hacking a research project</title>
		<link>http://discontents.com.au/shed/experiments/hacking-a-research-project</link>
		<comments>http://discontents.com.au/shed/experiments/hacking-a-research-project#comments</comments>
		<pubDate>Thu, 17 Jun 2010 13:49:22 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[archives]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[invisibleaustralians]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[White Australia]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=878</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Hacking+a+research+project&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2010-06-17&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/hacking-a-research-project&amp;rft.language=English"></span>
Amongst the holdings of the National Archives of Australia are some of the most visually arresting documents you&#8217;ll see &#8212; thousands and thousands of forms from the early decades of the twentieth century, each with a portrait photograph and palm print, each documenting the movements of a non-white resident. Along with many other certificates, regulations, [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Hacking+a+research+project&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2010-06-17&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/hacking-a-research-project&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=878"><!-- &nbsp; --></abbr>
<p>Amongst the holdings of the National Archives of Australia are some of the most visually arresting documents you&#8217;ll see &#8212; thousands and thousands of forms from the early decades of the twentieth century, each with a portrait photograph and palm print, each documenting the movements of a non-white resident. Along with many other certificates, regulations, correspondence and case files, these forms are part of the massive bureaucratic legacy of the White Australia Policy.</p>
<div class="wp-caption aligncenter" style="width: 470px">
<img style="visibility:hidden;width:0px;height:0px;" border=0 width=0 height=0 src="http://counters.gigya.com/wildfire/IMP/CXNID=2000002.11NXC/bT*xJmx*PTEyNzY3NzEwMDA5MjQmcHQ9MTI3Njc3MTAwNTYyOSZwPTkwMjA1MSZkPSZnPTEmb2Y9MA==.gif" /><object id="ci_10145_o" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" width="460" height="300"><param name="movie" value="http://apps.cooliris.com/embed/cooliris.swf"/><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><param name="bgColor" value="#121212" /><param name="flashvars" value="feed=http%3A%2F%2Fwraggelabs.com%2Frecordsearch%2Frss%2F7473965%2F%3Fpages%3D70%26ref%3DST84%2F1%2C%25201906%2F221-230&numrows=2" /><param name="wmode" value="opaque" /><embed id="ci_10145_e" type="application/x-shockwave-flash" src="http://apps.cooliris.com/embed/cooliris.swf" width="460" height="300" allowFullScreen="true" allowScriptAccess="always" bgColor="#121212" flashvars="feed=http%3A%2F%2Fwraggelabs.com%2Frecordsearch%2Frss%2F7473965%2F%3Fpages%3D70%26ref%3DST84%2F1%2C%25201906%2F221-230&numrows=2" wmode="opaque"></embed></object>
<p class="wp-caption-text">These certificates allowed non-white Australians travelling overseas to re-enter the country. NAA: ST84/1, 1906/21-30</p></div>
<p>But these are more than just interesting looking pieces of paper, they are snapshots of people&#8217;s lives. The forms capture data about an individual&#8217;s place of birth, physical characteristics and more. Over time a person might have submitted several of these forms, so by bringing them together we could trace their history, we could map their journeys &#8212; we could even watch them age.</p>
<p>The system which sought to render non-whites invisible has captured and preserved the outlines of their lives. By extracting and linking this data we could build a picture of another Australia, an Australia in which non-white residents lived, loved, struggled and succeeded, despite the impositions of a repressive regime.</p>
<p>I talked about these records at the <a href="http://theaahc.org/conferences/2009conference/">AAHC conference</a> last year, inspired in part by Tim Hitchcock&#8217;s chapter in the <em>Virtual Representation of the Past</em>. Tim Hitchcock argues that technology can allow us to restructure archives, looking beyond institutional hierarchies to the lives of individuals contained within:</p>
<blockquote><p>What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?
</p></blockquote>
<p>I don&#8217;t know, but I&#8217;d like to find out.</p>
<p>During my AAHC talk, Dave Lester suggested that the extraction of data from these forms might make a good crowdsourcing project. It&#8217;s a great idea. As you can see, the data is generally well-structured and legible, it should be possible to construct a simple series of forms that would allow volunteers to transcribe the data. The next stage would be to try and match identities across forms. That&#8217;s more complicated, but projects such as Tim Hitchcock&#8217;s <a href="http://www.londonlives.org/">London Lives</a> show how users can construct identities by connecting a range of historical documents.</p>
<p>Then there are connections to resources outside of the archives &#8212; photographs, local histories, newspapers, genealogies, cemetery registers and more. By keeping our system open and extensible, and by working with others to help them expose their information in standard ways, it should be possible to develop the framework for an evolving mesh of biographical data.</p>
<p>So, how do we get started? This is the point when you usually have to start thinking about money &#8212; how can I fund this? In Australia that generally means a journey into the arcane world of the Australian Research Council. The ARC suffers from all the problems of a peer-reviewed system, but added to this is a rather antiquated notion of what research is.</p>
<p>In the rules covering each of the main schemes it&#8217;s clearly stated that the &#8216;compilation of data&#8217; and the &#8216;development of research aids or tools&#8217; are not supported. I spend part of my life working for the <a href="http://ands.org.au/">Australian National Data Service</a>, an organisation that seeks to highlight how the sharing and reuse of data can open up new research possibilities. The ARC, however, seems to think that data has little value beyond its original research context.</p>
<p>Of course you can still mount a case for such activities. Applicants for a &#8216;Discovery&#8217; grant can argue that data creation is integral to their project and provide details of the &#8216;specific research questions to be addressed&#8217;. But what if you don&#8217;t yet know what the questions are? Part of the point of a project such as this is to try and find out what questions <em>we are able</em> to ask. Until we start to compile, link and explore the data, the &#8216;specific research questions&#8217; will be little more than convenient fictions, dreamt up to satisfy the prodding of peer reviewers.</p>
<p>Tom Scheinfeldt wrote a <a href="http://www.foundhistory.org/2010/05/12/wheres-the-beef-does-digital-humanities-have-to-answer-questions/">fantastic blog post</a> recently, responding to concerns about the failure of many digital humanities projects to make arguments or answer questions. Drawing examples from the history of science, Tom argues:</p>
<blockquote><p>we need to make room for both kinds of digital humanities, the kind that seeks to make arguments and answer questions now and the kind that builds tools and resources with questions in mind, but only in the back of its mind and only for later. We need time to experiment and even&#8230; time to play.</p></blockquote>
<p>The ARC does not fund play.</p>
<p>You might imagine that the ARC&#8217;s infrastructure funding scheme would offer more hope for a project such as this. And yes, there are many worthy projects involving databases and online tools that have been supported in this way (and I have benefited from some of them!). But it seems that in the minds of research funders infrastructure is always BIG. Grants start at $150,000, and applications are expected to involve multiple institutional partners. Projects have to be scaled up to fit the ARC&#8217;s definition of infrastructure, often resulting in complex, lumbering, long-term projects whose products are out of date by the time of their release.</p>
<p>There is no room in our current infrastructure models for agile, innovative, user-focused digital toolmakers seeking small amounts to experiment with apps, prototypes, datasets or visualisations. I often look with envy upon the US National Endowment for the Humanities <a href="http://www.neh.gov/grants/guidelines/digitalhumanitiesstartup.html">Digital Humanities Start-Up Grants</a>.</p>
<p>In any case, neither I nor my partner in this endeavour, Kate Bagnall (<a href="http://twitter.com/baibi">@baibi</a>), are currently in academic positions, so our chances of gaining any sort of research funding are next to none. We have the expertise &#8212; Kate has spent many years researching Australian-Chinese families and knows the records back-to-front, while I just can&#8217;t help playing with biographical data &#8212; but is that enough? How can you mount an ongoing research project without institutional support, research funding and the various badges and signifiers of academic authority?</p>
<p>I don&#8217;t know that either, but I have some ideas.</p>
<div id="attachment_918" class="wp-caption aligncenter" style="width: 222px"><a href="http://discontents.com.au/wp-content/uploads/2010/06/cedt.jpeg"><img src="http://discontents.com.au/wp-content/uploads/2010/06/cedt_photo-212x300.jpg" alt="Ah Yin Pak Chong" title="cedt_photo" width="212" height="300" class="size-medium wp-image-918" /></a><p class="wp-caption-text">Mrs Ah Yin Pak Chong. NAA: ST84/1, 1907/321-330</p></div>
<p>I didn&#8217;t manage to get a contribution together for Dan Cohen and Tom Scheinfeldt&#8217;s crowdsourced-in-a-week book, <a href="http://hackingtheacademy.org/">Hacking the Academy</a>, but watching the process from afar I did begin to wonder about how we might hack the way we build and run major research projects. This is what I have in mind:</p>
<ul>
<li>To strip down the large, lumbering beasts and design projects that are modular and opportunistic &#8212; able to grow quickly when resources allow, to bolt on related projects, to absorb existing tools.</li>
<li>To follow the data freely across technological and institutional boundaries, developing open networks that invite participation and use.</li>
<li>To develop a floating pool of collaborators, both inside and outside of academia, who are able to come and go, contributing whatever and whenever they can.</li>
<li>To make everything public, accessible and standards-compliant, so that even if the project stalls it could be picked up and developed by someone else.</li>
</ul>
<p>Most of all I just want to be able to do it. I don&#8217;t want to second-guess the ARC. I don&#8217;t want to spend months negotiating with potential partners or begging for an institutional home. I want to build, experiment and play. I want to make a start.</p>
<p>So that&#8217;s what we&#8217;re going to do.</p>
<p>We have a topic, plenty of raw materials, some basic principles and the beginnings of a plan. We even have a name &#8212; <em>Invisible Australians: Living under the White Australia Policy</em>. </p>
<p>As the project develops, I&#8217;ll be blogging here about some of the technical stuff, while Kate will be exploring the content over at <a href="http://chineseaustralia.org/">the tiger&#8217;s mouth</a>. I hope to have a prototype of the transcription tool ready to demo at <a href="http://thatcampcanberra.org/">THATCamp Canberra</a>, while Kate is already at work putting together guides on using the records and developing an <a href="http://omeka.org">Omeka</a> site that follows a number of Chinese-Australian families through the archives.</p>
<p>Can we hack together a major research project? Let&#8217;s find out. </p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/hacking-a-research-project/feed</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>(a not so) Quick catch up</title>
		<link>http://discontents.com.au/shed/a-not-so-quick-catch-up</link>
		<comments>http://discontents.com.au/shed/a-not-so-quick-catch-up#comments</comments>
		<pubDate>Fri, 07 May 2010 15:37:13 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[the shed]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[games]]></category>
		<category><![CDATA[greasemonkey]]></category>
		<category><![CDATA[identities]]></category>
		<category><![CDATA[machine tags]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[People Australia]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[userscripts]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=843</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=%28a+not+so%29+Quick+catch+up&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2010-05-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/a-not-so-quick-catch-up&amp;rft.language=English"></span>
The trained guinea pigs in the Wragge Labs bunker have been churning out all sorts of stuff in the last few months, and I&#8217;m way behind in my attempts to document their activities. So this is a bit of a catch-up post to try and commit a few pertinent details to the collective memory bank [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=%28a+not+so%29+Quick+catch+up&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2010-05-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/a-not-so-quick-catch-up&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=843"><!-- &nbsp; --></abbr>
<p>The trained guinea pigs in the Wragge Labs bunker have been churning out all sorts of stuff in the last few months, and I&#8217;m way behind in my attempts to document their activities. So this is a bit of a catch-up post to try and commit a few pertinent details to the collective memory bank before they are lost forever in the sleep-deprived fog of day-to-day existence.</p>
<h3>Identity upgrades</h3>
<p>There have been a number of major improvements to <a href="http://wraggelabs.com/identities/">Wragge&#8217;s Identity Browser</a>. Regular viewers will recall that the Identity Browser is built on top of the <a href="http://www.nla.gov.au/apps/srw/search/peopleaustralia">People Australia SRU interface</a>. You might not realise, however, that People Australia contains details of many organisations as well as people. We can only be thankful that it wasn&#8217;t called Entity Australia.</p>
<p>The first version of my Identity Browser only searched for people, but now all your corporate-entity-identification needs are also met, with only a few minor changes to the interface so-beloved by numerous generations of identity seekers. To be specific, through the wonders of drop-down technology you can choose whether you want to search for a person or an organisation. Or not. You can also just ignore that and search for everything and get back sensible results anyway. It&#8217;s your choice. Or not.</p>
<div id="attachment_864" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/identities/"><img class="size-medium wp-image-864" title="identities" src="http://discontents.com.au/wp-content/uploads/2010/05/identities-300x77.jpg" alt="" width="300" height="77" /></a><p class="wp-caption-text">Gaze in awe at the power of my dropdown</p></div>
<p>Ah pattern matching&#8230; there are few phrases so redolent of warm summer days, hidden pleasures, and the subtle delights of wildcard characters. The People Australia SRU interface was sadly lacking in the pattern matching department, but this has now been rectified. So now you mix your stems and asterixes with wild abandon. Searching for &#8216;Curtin, J*&#8217; will now retrieve all those Curtins whose names begin with &#8216;J&#8217;. Amazing isn&#8217;t it?</p>
<p>Astonishing too is the fact that the accompanying &#8216;Identify me!&#8217; bookmarklet continues to function with nary a murmur of protest. There is, however, a little bit of cleverness built-in to enhance your bookmarklet experience. If the text that you highlight has a comma in it, the Identity Browser will conclude that you&#8217;re feeding it the name of a person – ie Surname, Firstname – and will treat the Firstname as a stem. So if you highlight &#8216;Whitlam, G&#8217; and click on the bookmarklet, the Identity Browser will be kick-started into life, searching for everything that matches surname equals &#8216;Whitlam&#8217; and firstname is like &#8216;G*&#8217;. If there&#8217;s no comma – ie firstname secondname – then it heads off to look for either a person whose surname equals &#8216;secondname&#8217; and whose firstname is like &#8216;firstname*&#8217;, or an organisation whose name includes both &#8216;firstname&#8217; and &#8216;secondname&#8217;. Got all that?</p>
<p>Basically the idea was to try and provide some sensible defaults so you really don&#8217;t have to think about it too much.</p>
<p>I have it in my head to prepare a long and rapturous homage to the wonders of machine tags. With their sly semantic ways and easy-going nature, they offer some exciting possibilities not just for user-generated content, but user-generated meanings and user-generated relationships. But for the full, ripe pleasure of that post you will have to wait another day, for now I shall simply say that as well as RDFa, the Identity Browser provides automagically-generated machine tags.</p>
<p>Where might you use them? Flickr&#8217;s a good place to start. Try identifying the subjects and creators of Flickr photos. At the NSW Reference and Information Services Group Seminar the other day I challenged those in attendance to go forth and machine tag. Already more than 100 machine tags have been added to Flickr using my Identity Browser. Expect to hear more about the Great Flickr Machine Tag Challenge soon&#8230;</p>
<p>One more thing&#8230; try adding &#8216;.rdf&#8217; on to the end of an identity record – eg <a href="http://wraggelabs.com/identities/person/612109.rdf">http://wraggelabs.com/identities/person/612109.rdf</a>. Just an experiment at the moment&#8230;</p>
<h3>More machine tag love</h3>
<p>One night on Twitter, <a href="http://twitter.com/lifeasdaddy">@lifeasdaddy</a> pointed out that someone had started using fragments of urls from the <a href="http://trove.nla.gov.au/newspaper">NLA newspapers site</a> as tags in the <a href="http://www.powerhousemuseum.com/collection/database/?irn=244414">Powerhouse Museum&#8217;s collection database</a>. In the conversation that ensued with <a href="http://twitter.com/sebchan">@sebchan</a> and others, I suggested that the PHM could encourage this sort of rich tagging by supporting machine tags, with all their wonderful juicy semantic goodness The guinea pigs got excited as well, and before I knew it, they&#8217;d constructed a little <a href="http://semweb-helper.appspot.com/">Semweb Helper app</a>.</p>
<p>The Semweb Helper comes with its very own custom-tailored bookmarklet. If you find an article on the NLA newspapers site that you&#8217;d like to point to, just click on the bookmarklet and marvel as a range of useful machine tags are automagically generated. Then you just pick the appropriate tag, copy and paste et voila – instant semantic gratification.</p>
<div id="attachment_861" class="wp-caption aligncenter" style="width: 310px"><a href="http://semweb-helper.appspot.com/"><img class="size-medium wp-image-861" title="semweb-helper" src="http://discontents.com.au/wp-content/uploads/2010/05/semweb-helper-300x147.jpg" alt="Screenshot" width="300" height="147" /></a><p class="wp-caption-text">Try out the Semweb Helper</p></div>
<p>It&#8217;s a very simple little app, and really just a demonstration of how semantic web technologies might be made available to the masses. It was also the first time the guinea pigs had been allowed to play with the Google Apps Engine.</p>
<h3>Who am I?</h3>
<p>This short catch-up post has become something quite long and rambling. Did I mention that I&#8217;m sleep-deprived? Anyway, a recent addition to the Wragge Labs range of lifestyle accessories is <a href="http://wraggelabs.com/whoami/">&#8216;Who am I?&#8217; </a>– a simple little game that is something like a cross between hangman and Wheel of Fortune. Choosing a person at random from People Australia and the <em>Australian Dictionary of Biography</em>, &#8216;Who am I?&#8217; tests your powers of logic, stamina and historical guesstimation.</p>
<p>Your challenge is to figure out the surname of the mystery historical personage. To help you there are a series of clues, such as their birthplace and known associates. With each guess you also see a little bit more of their portrait. But beware! For ten wrong guesses are all that are permitted to any so brave as to enter upon this quest. Not eleven or twelve, but ten and ten only. To ignore this limit is to invite ridicule and disdain – do so at your peril.</p>
<div id="attachment_858" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/whoami/"><img class="size-medium wp-image-858" title="whoami" src="http://discontents.com.au/wp-content/uploads/2010/05/whoami-300x137.jpg" alt="Who am I screenshot" width="300" height="137" /></a><p class="wp-caption-text">Play Who am I?</p></div>
<p>&#8216;Who am I&#8217; builds upon some work I&#8217;ve been doing for the National Museum of Australia – looking at ways of mashing together various types of date-identified data. As part of that project I&#8217;ve built a series of APIs and have scraped, pummelled and munged data from a variety of sources.</p>
<p>What&#8217;s the point? I wonder this myself sometimes, particularly after I fling such things off into the aethernet and hear naught but a rare retweet. I am, after all, only in it for the glory, oh and the money of course. (Hmmm, I must look again at that business plan.) The point is twofold: first to highlight possibilities for the re-use and remixing of cultural data; second, to play with game-based models for discovery and exploration of cultural resources; and&#8230; err&#8230; thirdly just to try building something a little different.</p>
<p>Of course, if you like &#8216;Who am I?&#8217; you will probably also want to try <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a>&#8230;</p>
<h3>Headline Roulette Reprieve</h3>
<p>At the end of <a href="http://discontents.com.au/shed/experiments/headline-roulette">our last instalment</a>, the future of <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a> seemed in dire peril. Changes to the National Library of Australia web site threatened its very existence. Did it have a future? Could it survive? And did anybody care?</p>
<p>As we pick up the story oblivion looms. The feared changes are confirmed, but just as all seems lost&#8230; is it? Could it be? Yes, an advanced search facility is added to the newspapers site within Trove. Sensing this may be their only opportunity, the guinea pigs leap into action, building <a href="http://bitbucket.org/wragge/nla-newspapers-scraper">a new screen-scraper</a>, saving Headline Roulette from doom, and setting the world upon the path to a safer, happier future.</p>
<p>In short, Headline Roulette will live on&#8230; so enjoy.</p>
<h3>Handing out some presents</h3>
<p>My head is easily turned by flattery and praise. Yes, I really am so shallow and so vain. But this means that if people say nice things to me, I&#8217;m inclined to give them presents.</p>
<p>As well as doing exciting things in the web 2.0 realm for the PROV, <a href="http://twitter.com/asaletourneau">@asaletourneau</a> leaves nice comments on this blog. So he earned himself a present. It&#8217;s not much, but I <a href="http://userscripts.org/scripts/show/71421">built a userscript</a> that displays photos from the PROV site in a neat little slideshow (it&#8217;s the non-3D javascript version of CoolIris). Install Greasemonkey, get the userscript and <a href="http://proarchives.imagineering.com.au/index_search.asp?searchid=41">try it out</a> (just do a search, then click on the &#8216;Browse as slideshow&#8217; button&#8217;).</p>
<div id="attachment_852" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2010/05/prov-slideshow.jpg"><img class="size-medium wp-image-852" title="prov-slideshow" src="http://discontents.com.au/wp-content/uploads/2010/05/prov-slideshow-300x187.jpg" alt="Screen capture of slideshow" width="300" height="187" /></a><p class="wp-caption-text">PROV transport photos in a pretty slideshow</p></div>
<p>The State Library of NSW, or more specifically <a href="http://www.twitter.com/ellenforsyth">@ellenforsyth</a>, also earned my favour by inviting me to rave on about Linked Data at the afore-mentioned NSW RISG seminar. As a result, I added support for the SLNSW photo collections to my <a href="http://discontents.com.au/shoebox/archives-shoebox/harvesting-context-1">Flickr Context Harvester</a> userscript. Well&#8230; it&#8217;s the thought that counts, right? Once again – install Greasemonkey, <a href="http://userscripts.org/scripts/show/56135">get the userscript</a> and then <a href="http://acms.sl.nsw.gov.au/item/itemDetailPaged.aspx?itemID=447435">try it out</a>.</p>
<div id="attachment_855" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2010/05/slnsw-flickr.jpg"><img class="size-medium wp-image-855" title="slnsw-flickr" src="http://discontents.com.au/wp-content/uploads/2010/05/slnsw-flickr-300x181.jpg" alt="Flickr context harvestr screenshot" width="300" height="181" /></a><p class="wp-caption-text">The Flickr Context Harvester in action</p></div>
<h3>And coming up&#8230;</h3>
<p>Stay tuned for more on the Great Flickr Machine Tag Challenge, screencasts demonstrating my Identity Browser, some playing with relationships, and much much more. But right now the squirming baby on my lap needs a nappy change&#8230;</p>
<p>Did I mention that I&#8217;m sleep deprived?</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/a-not-so-quick-catch-up/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

