<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>discontents &#187; visualisation</title>
	<atom:link href="http://discontents.com.au/tag/visualisation/feed" rel="self" type="application/rss+xml" />
	<link>http://discontents.com.au</link>
	<description>working for the triumph of content over form, ideas over control, people over systems</description>
	<lastBuildDate>Mon, 21 May 2012 13:27:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>The new QueryPic (or what a difference an API makes)</title>
		<link>http://discontents.com.au/shed/experiments/the-new-querypic-or-what-a-difference-an-api-makes</link>
		<comments>http://discontents.com.au/shed/experiments/the-new-querypic-or-what-a-difference-an-api-makes#comments</comments>
		<pubDate>Tue, 17 Apr 2012 13:06:55 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[Papers Past]]></category>
		<category><![CDATA[QueryPic]]></category>
		<category><![CDATA[Trove]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1655</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=The+new+QueryPic+%28or+what+a+difference+an+API+makes%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2012-04-17&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/the-new-querypic-or-what-a-difference-an-api-makes&amp;rft.language=English"></span>
It seems a bit late to be introducing the newest version of QueryPic. Folks are already using it to explore the contents of digitised newspapers made available through Trove and Papers Past. Some, like the National Library of New Zealand, Andrew S. Bowman and the Carnamah Historical Society are already blogging about it. But I suppose [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=The+new+QueryPic+%28or+what+a+difference+an+API+makes%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2012-04-17&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/the-new-querypic-or-what-a-difference-an-api-makes&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1655"><!-- &nbsp; --></abbr>
<p>It seems a bit late to be introducing the newest version of <a href="http://wraggelabs.com/shed/querypic/">QueryPic</a>. Folks are already using it to explore the contents of digitised newspapers made available through <a href="http://trove.nla.gov.au/newspaper/">Trove</a> and <a href="http://paperspast.natlib.govt.nz/cgi-bin/paperspast">Papers Past</a>. Some, like the <a href="http://beta.natlib.govt.nz/blog/a-tale-of-two-islands">National Library of New Zealand</a>, <a href="http://andrew-s-bowman.blogspot.com.au/2012/04/querypic-new-tool-for-historical.html">Andrew S. Bowman</a> and the <a href="http://carnamah.blogspot.com.au/2012/04/mentions-of-carnamah-in-australian.html">Carnamah Historical Society</a> are already blogging about it. But I suppose I&#8217;d better document a few things&#8230;</p>
<p>As I noted in my <a title="QueryPicNZ" href="http://discontents.com.au/shed/experiments/querypicnz">post about QueryPicNZ</a> (yes I now have a rather confusing proliferation of QueryPics), I was waiting for the Trove API to become public. Last week I noticed a little &#8216;API&#8217; link pop up in the Trove footer and so I set to work&#8230;</p>
<div id="attachment_1662" class="wp-caption aligncenter" style="width: 530px"><a href="http://wraggelabs.com/shed/querypic/?q=%22the%20past%22|aus&amp;q=%22the%20future%22|aus"><img class="size-large wp-image-1662" title="new_querypic" src="http://discontents.com.au/wp-content/uploads/2012/04/new_querypic-520x477.png" alt="" width="520" height="477" /></a><p class="wp-caption-text">&quot;The past&quot; versus &quot;the future&quot; in the new QueryPic</p></div>
<p>My <a title="QueryPic" href="http://discontents.com.au/shed/hacks/querypic">original version of QueryPic</a> (<a href="http://journalofdigitalhumanities.org/1-1/reviews/querypic/">recently reviewed</a> in the <em>Journal of the Digital Humanities</em>) used a series of Python scripts to harvest and scrape content from the Trove web pages. This meant that you had to download the scripts and be code-confident enough to run them in a terminal. It&#8217;s still a useful tool and I&#8217;ll be updating it as well, but I wanted to create something quicker and simpler that encouraged people to explore and play.</p>
<p>The latest version of <a href="http://wraggelabs.com/shed/querypic/">QueryPic</a> (QueryPic+, QueryPic Web, <del>QueryPic 2.0</del>?) simply runs in your browser. It uses JQuery to grab data on the fly from the <a href="http://trove.nla.gov.au/general/api">Trove</a> and <a href="http://digitalnz.org.nz/">DigitalNZ</a> APIs. Like previous versions, it uses the <a href="http://www.highcharts.com/">HighCharts</a> library to turn the data into pretty graphs.</p>
<p>What does it do? It&#8217;s really pretty basic. QueryPic just displays the number of articles matching your search query over time. By default, these are displayed as a proportion of the total articles available for that year, but a dropdown field lets you switch to view the raw numbers. It&#8217;s simple, but it&#8217;s also remarkably evocative, suggestive and fun. <strong><a href="http://wraggelabs.com/shed/querypic/">Just try it!</a></strong></p>
<p>Why stop at just one query? To compare frequency patterns you can add as many as you like. Just keep entering new words or phrases.</p>
<p>If you notice an interesting peak or trough you can just click on it and another API request will be fired off to retrieve the first 20 matching articles. So it&#8217;s also a new way of exploring the newspaper databases themselves.</p>
<p>There are plenty of limitations &#8212; not all newspapers are digitised, for example, and the quality of the OCR is patchy. The <a href="http://beta.natlib.govt.nz/blog/a-tale-of-two-islands">National Library of New Zealand&#8217;s post</a> does a great job summing up a number of issues relating to Papers Past. It&#8217;s not magic, it&#8217;s not perfect, but is it useful? I think so.</p>
<p>Tasks for the future:</p>
<ul>
<li>Create some sort of backend that makes it easy to save , share and cite your query data. The &#8216;share&#8217; link just regenerates the graph which, of course, might change as new articles are added to the databases.</li>
<li>Make it possible to add more complex queries &#8212; I want to keep the interface simple, so I&#8217;ll probably create a bookmarklet to take any Trove or Papers Past query and display it using QueryPic.</li>
<li>As I mentioned over at the <a href="http://wraggelabs.com/emporium/2012/04/the-new-api-powered-future/">WraggeLabs Emporium</a>, I intend to rewrite my various Trove tools to work with the new API. This will include the classic Python version of QueryPic. I still think it&#8217;s useful for harvesting your own data.</li>
</ul>
<div>The <a href="https://github.com/wragge/QueryPic">code</a> is on my GitHub site and you can also follow updates at the <a href="http://wraggelabs.com/emporium/trove-tools/newspaper-search-summariser/">QueryPic page</a> in the WraggeLabs Emporium.</div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/the-new-querypic-or-what-a-difference-an-api-makes/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>QueryPic</title>
		<link>http://discontents.com.au/shed/hacks/querypic</link>
		<comments>http://discontents.com.au/shed/hacks/querypic#comments</comments>
		<pubDate>Sat, 31 Dec 2011 15:08:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[hacks]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[Trove]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1546</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=QueryPic&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2012-01-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/querypic&amp;rft.language=English"></span>
Back when I was looking at &#8216;When did the Great War become the First World War?&#8216; I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so&#8230; Anyway, the result is a rather neat little [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=QueryPic&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2012-01-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/querypic&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1546"><!-- &nbsp; --></abbr>
<p>Back when I was looking at &#8216;<a title="When did the ‘Great War’ become the ‘First World War’?" href="http://discontents.com.au/shed/experiments/when-did-the-great-war-become-the-first-world-war">When did the Great War become the First World War?</a>&#8216; I promised a detailed post on how I constructed the graphs. But of course I got distracted. Then I started adding new features to the script and redesigning the graphs, so&#8230;</p>
<p>Anyway, the result is a rather neat little gizmo henceforth named <a href="http://wraggelabs.com/emporium/trove-tools/newspaper-search-summariser/">QueryPic</a> (I got a bit sick of &#8216;search summariser&#8217; and &#8216;graph-maker thing&#8217;). <a title="Mining the treasures of Trove (part 2)" href="http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2">The first version</a> just harvested data and left all the graph-making to you. But QueryPic does it all! It harvests the data <em>and</em> makes the graph. Woohoo.</p>
<p>Here&#8217;s an example showing &#8216;drought&#8217; versus &#8216;flood&#8217;:</p>
<p><a href="http://wraggelabs.com/shed/trove/newgraphs/flood_drought.html"><img class="aligncenter size-medium wp-image-1551" title="Screen Shot 2012-01-01 at 1.53.28 AM" src="http://discontents.com.au/wp-content/uploads/2012/01/Screen-Shot-2012-01-01-at-1.53.28-AM-250x166.png" alt="" width="250" height="166" /></a></p>
<h4>QueryPic features</h4>
<ul>
<li>Explore your Trove newspaper query over time in the form of a simple line graph.</li>
<li>Interactive &#8212; click on a point to retrieve sample articles from that date.</li>
<li>Combine data sources to compare queries.</li>
<li>Choose your interval &#8212; plot by year or month.</li>
<li>Switch views between total results and the proportion of all articles.</li>
</ul>
<h4>Running QueryPic</h4>
<p>Yes, it&#8217;s a Python script and yes it runs on the command line. Let&#8217;s get that out of the way now. I don&#8217;t think I have the time and energy to develop cross-platform gui versions of all my tools. I&#8217;d rather spend the time adding new features or exploring new possibilities. Sorry, but until I have a wealthy benefactor or a technical support team, I think that&#8217;s the way it has to be. In any case, <a href="https://github.com/wragge/Trove-newspapers">the code is all there </a>&#8211; so build your own gui!</p>
<p>Actually, if I did have the time and energy I don&#8217;t think I&#8217;d build a standalone gui anyway. What would be much cooler would be a web service, where people could run, share and combine their queries. Social graph-making! A celebration of serendipity! A historical playground! Hmmm&#8230;</p>
<p>But for now there&#8217;s this python script. It&#8217;s dead easy to use. Starting from the beginning&#8230;</p>
<ol>
<li>Do you have Python installed? If you have a Mac or Linux the answer is yes. Fire up a terminal and type &#8216;python -V&#8217; &#8212; see, I told you. If you have Windows you can get a <a href="http://www.python.org/getit/windows/">handy installer</a>. Do it.</li>
<li>Get the source code. Just <a href="https://github.com/wragge/Trove-newspapers/zipball/master">download this zip file</a> and open it into a new folder.</li>
<li>Open a terminal and cd into the new folder.</li>
<li>Run &#8216;python do_totals.py [your Trove query]&#8216;.</li>
<li>Watch in excitement as the script chugs away retrieving data from Trove.</li>
<li>Once the script is finished, go to the &#8216;graphs&#8217; directory, where you&#8217;ll find your newly-created html page complete with fancy interactive graph.</li>
<li>Open the html page in the web browser of your choice.</li>
<li>Enjoy! Celebrate! Drink a toast in my honour!</li>
</ol>
<h4>Customising QueryPic</h4>
<p>There are a number of optional arguments that you add to the command line to customise your results:</p>
<p><strong>-n (or &#8211;name) [a query name]<br />
</strong>Give a name to your query. The name is used to create filenames for the html and data files, it is also used in the legend of the graph. The default is to use the search keywords as the name.</p>
<p><strong>-d (or &#8211;directory) [a directory path]</strong><br />
The full pathname of the directory/folder for your results. The default is a &#8216;graphs&#8217; sub-directory in the current directory.</p>
<p><strong>-g (or &#8211;graph) [a graph name]</strong><br />
Specify the name of the html file that&#8217;s created. This is useful for displaying multiple queries on a single graph. Just run QueryPic for each query, using the same graph name each time. The default is either the value specified by the -n parameter or a name derived from the search keywords.</p>
<p><strong>-m (or &#8211;monthly)</strong><br />
Plot the query at monthly intervals. The default interval is a year.</p>
<h4>What QueryPic actually does</h4>
<p>QueryPic builds a simple visualisation of your search query in the Trove newspaper database. A list of search results is difficult to interpret and offers little context. QueryPic shows you the number of articles matching your query over time, enabling you reframe your questions, pursue hunches, or simply play around.</p>
<p>QueryPic takes your Trove newspaper query and looks for a date range. If it doesn&#8217;t find one, it assumes you want your graph to go from 1803 to 1954 (the complete contents of the newspaper database &#8212; except for the Women&#8217;s Weekly). QueryPic then strips out any date parameters from the query, so it can fire off the query within the start and end dates, at the specified date interval.</p>
<p>Date interval? In the previous version of this script you could only plot points at yearly intervals, so it was impossible to zoom in an see what might be happening over the span of a single year or two. But amazing advances in QueryPic technology mean you can now plot changes <em>by month</em>. Here for example is a new version of my Great War/First World War graph, focused on 1938&#8211;1946 and plotted at monthly intervals.</p>
<p><a href="http://wraggelabs.com/shed/trove/newgraphs/great_war_1938_46.html"><img class="aligncenter size-medium wp-image-1552" title="Screen Shot 2012-01-01 at 1.55.22 AM" src="http://discontents.com.au/wp-content/uploads/2012/01/Screen-Shot-2012-01-01-at-1.55.22-AM-250x166.png" alt="" width="250" height="166" /></a></p>
<p>So for each interval within the date range QueryPic fires off a request to Trove. From the response it scrapes out the total number of results for that date. If the total is greater than zero, it then fires off a second request to find the total number of newspaper articles for that year. Your query results divided by the total number of articles gives the proportion of articles for that date matching your search query.</p>
<p>The number of results and the proportion are written to a javascript file, together with some other important information including the original query and the date the harvest was performed. Remember, the Trove newspapers database is always changing! QueryPic then grabs a copy of it&#8217;s own special html template and inserts a reference to this javascript file. For good measure, it also inserts a link to your original query. The file is saved under a new name, ready for you to open and explore.</p>
<p>The html file contains everything necessary to take your data and turn it into a graph. It does this using the HighCharts javascript library. Please note, that while licence conditions allow HighCharts to be redistributed as part of a non-commercial package, it is not free for commercial use. Check the <a href="http://www.highcharts.com/">HighCharts website</a> for details.</p>
<h4>Some examples</h4>
<p>Plot &#8216;cat&#8217; against &#8216;dog&#8217; in a graph called &#8216;animals&#8217;:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -g &quot;animals&quot;
python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -g &quot;animals&quot;</pre>
<p>Specify a directory for your results:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -d &quot;/User/bill/Documents/graphs&quot;</pre>
<p>Plot results at monthly intervals:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&amp;fromyyyy=1920&amp;toyyyy=1921&quot; -m</pre>
<p>Specify a name:</p>
<pre class="brush: bash; gutter: false">python do_totals.py &quot;http://trove.nla.gov.au/newspaper/result?q=cat&quot; -n &quot;Felines&quot;</pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/hacks/querypic/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Mining the treasures of Trove (part 2)</title>
		<link>http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2</link>
		<comments>http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2#comments</comments>
		<pubDate>Sun, 06 Mar 2011 13:44:02 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[Trove]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1174</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Mining+the+treasures+of+Trove+%28part+2%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-03-06&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2&amp;rft.language=English"></span>
One of the advantages of building something yourself is that if you&#8217;re not happy with it you can tweak, change, modify and adapt until you are. But one of the disadvantages is that sometimes you get so caught up in all the tweaking, changing and adapting that you overlook a much simpler solution. So I [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Mining+the+treasures+of+Trove+%28part+2%29&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=digital+humanities&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-03-06&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1174"><!-- &nbsp; --></abbr>
<p>One of the advantages of building something yourself is that if you&#8217;re not happy with it you can tweak, change, modify and adapt until you are. But one of the disadvantages is that sometimes you get so caught up in all the tweaking, changing and adapting that you overlook a much simpler solution.</p>
<p>So <a href="http://discontents.com.au/shed/mining-the-treasures-of-trove-part-1">I had a harvester</a> that could save the publication details and content of all the newspaper articles in a search on <a href="http://trove.nla.gov.au/newspaper?q=">Trove</a>. But the warm glow of self-satisfaction quickly began to fade as I started to think about how I wanted to use the content I was harvesting.</p>
<p>The harvester saved the text of articles organised in directories by newspaper title. This seemed to make sense. It meant that you could easily analyse and compare the content of different newspapers. But what if you wanted to examine changes over time? In that case it&#8217;d be much easier if the articles were organised by year &#8212; then I could just pull out the a folder from a particular year, feed it to <a href="http://voyeurtools.org/">VoyeurTools</a>, and start tracking the trends.</p>
<p>There ensued some minor tinkering. As a result, you can now you can pass an additional option to <a href="https://bitbucket.org/wragge/trove-tools/overview">the harvest script</a>, telling it whether to save the article texts and pdfs in directories by year or newspaper. Simply set the &#8216;zip-directory-structure&#8217; option in harvest.ini to either &#8216;title&#8217; or &#8216;year&#8217;. If you&#8217;re using the command-line you can use the &#8216;-d&#8217; flag to set your preference. Easy.</p>
<p>But that set me wondering whether it might be possible to generate an overview, showing the number of articles matching a search over time. So I started on a modification of my harvest script that did just that &#8212; cycling through the search results, adding up the numbers. It wasn&#8217;t until I ran the new script for the first time that I realised there was a much simpler alternative.</p>
<p>All I needed to do was repeat the search for each year in the search span and grab the total results value from the page. D&#8217;uh&#8230;</p>
<p>So instead of sending hundreds or perhaps thousands of requests to Trove, all I needed was one for each year. From there it was easy and soon I had my first graph.</p>
<div class="wp-caption aligncenter" style="width: 510px"><a href="http://www.flickr.com/photos/55336121@N00/5455553450/"><img title="Chinese in Australia - Trove graph" src="http://farm6.static.flickr.com/5180/5455553450_9fbd539d2f.jpg" alt="" width="500" height="330" /></a><p class="wp-caption-text">My first graph: Chinese in Australia (The Chinese Australian expert in my house predicted the 1888 peak.) </p></div>
<p>I was pretty pleased with that, but of course the raw numbers of articles on their own are rather misleading. The more interesting question was what proportion of the total number of articles for that year the search represents. Another quick tweak and I was grabbing the overall totals and calculating the proportions.</p>
<div class="wp-caption aligncenter" style="width: 510px"><a href="http://www.flickr.com/photos/55336121@N00/5455948202/"><img title="Trove graph - Chinese in Australia with proportions" src="http://farm6.static.flickr.com/5100/5455948202_5174a7a3de.jpg" alt="" width="500" height="336" /></a><p class="wp-caption-text">Total numbers versus proportions -- Chinese in Australia #2</p></div>
<p>At this point I invited my Twitter followers to suggest some possible topics &#8212; you can <a href="http://www.flickr.com/photos/55336121@N00/sets/72157626078999182/">see the results on Flickr</a>.</p>
<p>But what do the peaks and troughs represent? I wanted to use the graphs as a way of exploring the content itself. This was possible as I&#8217;d saved the data as JSON and used <a href="http://www.jqplot.com/">jqPlot</a> to create the graphs in an ordinary HTML page. Courtesy of some clever hooks in the backend of jqPlot I could capture the value of any point as it was clicked. That gave me the year, so all I had to do was combine this with the search keyword values and send off a request to <a href="http://trove.nla.gov.au/newspaper?q=">Trove</a>.</p>
<p>So now instead of just looking at the graphs, you could explore them.</p>
<div id="attachment_1185" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/shed/trove/graphs/chinese.html"><img class="size-medium wp-image-1185" title="chinese-graph" src="http://discontents.com.au/wp-content/uploads/2011/03/chinese-graph-300x218.png" alt="" width="300" height="218" /></a><p class="wp-caption-text">Explore -- Chinese in Australia #3</p></div>
<p>Perhaps you&#8217;re wondering how I managed to pull the Trove results into the page? Just a bit of simple AJAX magic combined with my own <a href="http://wraggelabs.appspot.com/api/newspapers/">unofficial Trove API</a>. (More about that in the next exciting installment!)</p>
<p>I&#8217;ve created a <a href="http://wraggelabs.com/shed/trove/graphs/">little gallery of graphs</a> to explore. I&#8217;m still open to suggestions!</p>
<p>The code for gathering the data is all on <a href="https://bitbucket.org/wragge/trove-tools/overview">Bitbucket</a>, so start building your own. Just run the &#8216;do_totals.py&#8217; script in the bin directory from the command line. The script takes two flags:</p>
<ul>
<li>-q (&#8211;query) the url of your Trove search (compulsory)</li>
<li>-f (&#8211;filename) the path and filename for your data file (don&#8217;t include an extension)</li>
</ul>
<p>The script will create a javascript file containing two JSON objects, &#8216;totals&#8217; and &#8216;ratios&#8217;. These can then be fed to jqPlot. View the source of one of my interactive graphs to see how.</p>
<p>Of course it would be really nice to create a web service where people could create, share, compare and combine their graphs &#8212; but that might have to await a generous benefactor&#8230;</p>
<p style="text-align: center;">
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/mining-the-treasures-of-trove-part-2/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Cloudy biographies and portrait walls</title>
		<link>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls</link>
		<comments>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls#comments</comments>
		<pubDate>Sat, 24 Jan 2009 08:26:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[ADB Online]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[word clouds]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=409</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
With a bit of time to play over Christmas I had a go at applying some of the techniques described at ProgrammingHistorian to the ADB Online.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=409"><!-- &nbsp; --></abbr>
<p>With a bit of time to play over Christmas I had a go at applying some of the techniques described at <a href="http://niche.uwo.ca/programming-historian/index.php"><em>ProgrammingHistorian</em></a> to the <a href="http://www.adb.online.anu.edu.au/adbonline.htm">ADB Online</a>.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had to offer as a way of improving access to the articles.</p>
<p>So I set about learning Python and was soon downloading and scraping the more than 10,000 articles that make up the ADB online.</p>
<p>My first tests revealed that the most frequent words in ADB articles were&#8230;</p>
<p style="text-align: center;"><strong>born</strong> and <strong>died</strong></p>
<p style="text-align: left;">Who&#8217;d have thought it? In a biographical dictionary?</p>
<p style="text-align: left;">After further refining the stopwords list I started to generate some useful clouds. Finally after 147 minutes of processing time, I had a <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html">word cloud</a> representing the content of all 16 volumes of the <em>Australian Dictionary of Biography</em>.</p>
<p style="text-align: left;">
<div id="attachment_559" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html"><img class="size-medium wp-image-559" title="adb-cloud-complete" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-complete-300x195.jpg" alt="The complete ADB word cloud" width="300" height="195" /></a><p class="wp-caption-text">The complete ADB word cloud</p></div>
<p style="text-align: left;"><span id="more-409"></span>The words in the cloud are linked back to the ADB&#8217;s own search engine, allowing the cloud to be used as a way of exploring the articles themselves.</p>
<p style="text-align: left;">It shows the top 200 words, but if you want to see the rest you can download the <a href="http://discontents.com.au/shed/adb/clouds/wordfreqs.txt">raw word frequency file</a> (&gt;1mb txt file).</p>
<p style="text-align: left;">What can you see? Amongst other things, John is obviously the most popular name, Sydney just edges out Melbourne as the most popular place, and burial beats cremation as the most common mode of dispatch. It&#8217;s fun to explore.</p>
<p style="text-align: left;">But of course this then set me wondering about how these frequencies might change with the development of the ADB and changes in its subjects. So I generated word clouds for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html">each volume</a> and for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html">each chronological series</a>.</p>
<p style="text-align: left;">
<div id="attachment_563" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html"><img class="size-medium wp-image-563" title="adb-cloud-volumes" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-volumes-300x195.jpg" alt="Word clouds by volume" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by volume</p></div>
<div id="attachment_564" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html"><img class="size-medium wp-image-564" title="adb-cloud-series" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-series-300x195.jpg" alt="Word clouds by series" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by series</p></div>
<p style="text-align: left;">
<p>I even added some simple Javascript slideshows so you could watch the clouds evolve.</p>
<p>One of the most obvious features in the series clouds is the gradual disappearance of &#8216;land&#8217;. It&#8217;s one of the most prominent words in the first series, but gradually fades until it disappears completely in the last.</p>
<p>After this successful foray into the world of word clouds, I began to think about other ways of visualising the ADB&#8217;s content. Many of the articles have portrait images, wouldn&#8217;t it be interesting to use the images themselves as the entry point to the biographical articles?</p>
<p>I&#8217;d already been <a href="http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d">playing with CoolIris</a>, so I decided to harvest all the portrait references and use them to create a 3D wall. The <a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html">result is pretty spectacular</a>.</p>
<div id="attachment_569" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html"><img class="size-medium wp-image-569" title="gallery" src="http://discontents.com.au/wp-content/uploads/2009/01/gallery-300x66.jpg" alt="ADB prtrait browser" width="300" height="66" /></a><p class="wp-caption-text">ADB portrait browser</p></div>
<p>Some technical details about the clouds and the portrait browser follow, for those interested in such things&#8230;</p>
<h3>Gathering your words</h3>
<p>Conveniently<em> </em>for me,<em> ProgrammingHistorian</em> uses the <em>Dictionary of Canadian Biography</em> as its main example, so there was much code that I could <span style="text-decoration: line-through;">just cut and paste</span> carefully examine and utilise.  As the examples show, it&#8217;s easy to grab a webpage and analyse its content on the fly. But I wanted to process more than 10,000 pages and I knew that I was unlikely to get it working the first time round, so I decided to download the files first and then work on them locally. PH provided a basic example, to which I added some error-handling and the necessary loops to cycle through the ADB files. Because I had a bit of inside knowledge I cheated and hard-coded the numbers of articles in each volume. If I hadn&#8217;t known this I would have had to scrape all the browse pages, pulling out the links and creating a list in individual ids – not hard, but a bit tedious. Anyway this is how it ended up:</p>
<pre>
<pre class="brush: python">
# download_adb.py

import urllib2, time, os, sys
import dh
items = (565, 575, 607, 526, 614, 533, 543, 723, 737, 742, 737, 759, 755, 721, 703, 714, 694, 126)
if os.path.exists(&#039;adb&#039;) == 0: os.mkdir(&#039;adb&#039;)

for v in range(0,18):
    for i in range (1,(items[v]+1)):
        if v == 0:
            filename = &#039;AS1%04db.htm&#039; % i
        else:
            filename = &#039;A%02d%04db.htm&#039; % (v, i)
        if os.path.isfile(&#039;adb/&#039; + filename) == 0:
            print &#039;Processing: &#039; + filename
            url = &#039;http://adbonline.anu.edu.au/biogs/&#039; + filename
            try:
                response = urllib2.urlopen(url)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                html = response.read()
                f = open(&#039;adb/&#039; + filename, &#039;w&#039;)
                f.write(html)
                f.close
                time.sleep(2)
        else:
            print &quot;File already downloaded&quot;
        sys.stdout.flush()</pre>
<h3>Learning to count</h3>
<p>Before too long I had a directory full of about 11,000 little html files just waiting for me to begin my evil experiments. First I had to slice them up and pull out all the interesting bits. By examining the code of the pages I could see that the main content was inside a div with the id of &#8216;content&#8217;. Using the Beautiful Soup Python library, I was easily able to extract this div. But the content div also usually included a portrait image and a bibliography. Once again I dipped into Beautiful Soup to discard all the unwanted bits. The slicing and dicing went something like this:</p>
<pre>
<pre class="brush: python">
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)</pre>
<p>Now I had the text of the article to play with. Following the PH examples it wasn&#8217;t long before I could extract word-frequency tables from a few files at a time. However, when I tried to process all the articles from a particular volume it took a verrry long time. I fiddled a bit with the code and amazed myself by dramatically improving the performance. I replaced the <em>wordListToFreqDict</em> function provided by PH with my own modified version:</p>
<pre>
<pre class="brush: python">
def wordListToFreqDict2(wordlist):
    worddict = dict.fromkeys(wordlist)
    wordfreq = [wordlist.count(p) for p in worddict.keys()]
    return dict(zip(worddict,wordfreq))
</pre>
<p>The <code>worddict = dict.fromkeys(wordlist)</code> line made all the difference, creating a list of unique words that could then be checked against the full word list.  With this hack in place I was able to process a complete volume in a few minutes.</p>
<p>I was already using a list of stopwords provided by PH to exclude things such as &#8216;such&#8217; , &#8216;as&#8217; and &#8216;and&#8217;, but obviously a few additions were necessary. To the list of stopwords I added:</p>
<pre>
<pre class="brush: python">
stopwords += [&#039;january&#039;, &#039;february&#039;, &#039;march&#039;, &#039;april&#039;, &#039;may&#039;, &#039;june&#039;, &#039;july&#039;, &#039;august&#039;, &#039;september&#039;, &#039;october&#039;, &#039;november&#039;, &#039;december&#039;]
stopwords += [&#039;new&#039;, &#039;south&#039;, &#039;wales&#039;, &#039;australia&#039;, &#039;australian&#039;, &#039;victoria&#039;, &#039;south&#039;, &#039;western&#039;, &#039;queensland&#039;, &#039;tasmania&#039;]
#stopwords += [&#039;sydney&#039;, &#039;melbourne&#039;, &#039;brisbane&#039;, &#039;adelaide&#039;, &#039;perth&#039;, &#039;hobart&#039;]
stopwords += [&#039;died&#039;, &#039;born&#039;, &#039;life&#039;, &#039;lived&#039;, &#039;married&#039;, &#039;father&#039;, &#039;wife&#039;, &#039;children&#039;, &#039;son&#039;, &#039;sons&#039;, &#039;daughter&#039;, &#039;daughters&#039;, &#039;brother&#039;, &#039;brothers&#039;]
stopwords += [&#039;street&#039;, &#039;st&#039;, &#039;year&#039;, &#039;years&#039;, &#039;months&#039;, &#039;acre&#039;, &#039;acres&#039;, &#039;ha&#039;]
stopwords += [&#039;e&#039;, &#039;m&#039;, &#039;b&#039;, &#039;c&#039;, &#039;w&#039;, &#039;j&#039;, &#039;d&#039;, &#039;n&#039;, &#039;f&#039;, &#039;g&#039;, &#039;h&#039;, &#039;i&#039;, &#039;ii&#039;, &#039;l&#039;, &#039;o&#039;, &#039;p&#039;, &#039;th&#039;, &#039;r&#039;, &#039;t&#039;, &#039;u&#039;, &#039;r&#039;, &#039;nd&#039;]
</pre>
<p>The first two lines should be pretty obvious. As you can see, I originally excluded names of the capital cities, but then realised that you could watch Sydney and Melbourne battle it out for pre-eminence, so I excluded the exclusion. Also out were family relations and various other words that turned up in almost every article. Cleaning out all the non-alphabetical characters from the text had left a lot of orphaned letters that had once been things like £ signs, so I had to dispose of them as well.</p>
<p>The modules for actually generating the clouds were mostly just copied from PH with a few minor changes. My complete script is here:</p>
<pre>
<pre class="brush: python">
# adb-text-count.py

import urllib2
import dh, os, sys, time
from BeautifulSoup import BeautifulSoup
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
filelist = dh.getFileNames(dir)

f = open(&#039;wordlist.txt&#039;, &#039;w&#039;)
for file in filelist:
    print &#039;Processing &#039; + file
    sys.stdout.flush()
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)
    text = dh.stripTags(str(content[0]))
    fullwordlist = dh.stripNonAlpha(text.lower())
    wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
    f.write(&quot; &quot;.join(wordlist))
f.close
f = open(&#039;wordlist.txt&#039;)
words = f.read()
f.close
wordlist = words.split(&quot; &quot;)
dictionary = dh.wordListToFreqDict2(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
f = open(&#039;wordfreqs.txt&#039;, &#039;w&#039;)
for s in sorteddict: f.write(str(s)+&quot;\n&quot;)
f.close
print &#039;Dictionary created&#039;
sys.stdout.flush()
# create tag cloud and open in Firefox
cloudsize = 200
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
outstring = &#039;&#039;
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
print &#039;Creating cloud&#039;
sys.stdout.flush()
for k in resorteddict:
    kfreq = k[0]
    klabel = k[1]
    klabel = dh.undecoratedHyperlink(&#039;http://adbonline.anu.edu.au/scripts/adbp-ent_search.php?ranktext=&#039; + k[1] + &#039;&amp;amp;search=Go!&#039;, k[1])
    scalingfactor = (kfreq - minfreq) / float(freqrange)
    outstring += &#039; &#039; + dh.scaledFontSizeSpan(klabel, scalingfactor) + &#039; &#039;
dh.wrapStringInHTML(&quot;html-to-tag-cloud&quot;, dh.defaultCSSDiv(outstring), &quot;Complete&quot;)
finish = time.time()
print &quot;Finished at: &quot;, time.asctime(time.localtime(finish))
print &quot;Total time: &quot;, finish - start
</pre>
<h3>Biographies in 3D</h3>
<p>To display all the portrait images in CoolIris I had to harvest all the image details and then write them to a Media RSS file for CoolIris to read.</p>
<p>Extracting the details of all the thumbnail versions of the portraits in the ADB was easy using Beautiful Soup. But I also need the paths to the larger versions of the portraits stored on the sites of the repositories that hold the originals. All of these sites present the images differently, so a different scraper was needed for each of them. As yet I&#8217;ve only included major libraries and archives – I may add some more if I get the time.</p>
<p>Once the paths to the thumbnails and large versions had been harvested, it was just a matter of writing the RSS feed. Actually, I created a series of RSS files, one for each volume, linked using &#8216;rel=previous&#8217; and &#8216;rel=next&#8217; attributes. This helped speed up the loading of the gallery. For what it&#8217;s worth, the complete code is here:</p>
<pre>
<pre class="brush: python">
# adb-portraits.py

import socket, urllib2, urllib
import dh, os, sys, time, re
from BeautifulSoup import BeautifulSoup
# timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
for i in range(8,18):
    if (i == 17): vol = &quot;AS1&quot;
    else: vol = &quot;A%02d&quot; % i
    filelist = dh.getFileNamesByVol2(dir, vol)
    f = open(&#039;adb-portraits-%s.rss&#039; % i, &#039;w&#039;)
    f.write(&quot;&lt;?xml version=&#039;1.0&#039; encoding=&#039;utf-8&#039; standalone=&#039;yes&#039;?&gt;\n&quot;)
    f.write(&quot;&lt;rss version=&#039;2.0&#039; xmlns:media=&#039;http://search.yahoo.com/mrss/&#039; xmlns:atom=&#039;http://www.w3.org/2005/Atom&#039;&gt;\n&quot;)
    f.write(&quot;&lt;channel&gt;\n&quot;)
    f.write(&quot;&lt;title&gt;ADB Online Portrait Browser&lt;/title&gt;\n&quot;)
    f.write(&quot;&lt;description&gt;Portraits of individuals included in the Australian Dictionary of Biography&lt;/description&gt;\n&quot;)
    f.write (&quot;&lt;link&gt;http://www.adb.online.anu.edu.au&lt;/link&gt;\n&quot;)
    if (i &gt; 1):
        f.write (&quot;&lt;atom:link rel=&#039;previous&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i-1))
    if (i &lt; 17):
        f.write (&quot;&lt;atom:link rel=&#039;next&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i+1))
    for file in filelist:
        print str(file)
        sys.stdout.flush()
        g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
        html = g.read()
        g.close()
        #print html
        sys.stdout.flush()
        soup = BeautifulSoup(html)
        imagediv = soup.findAll(id=&quot;imagebox&quot;)
        if len(imagediv) &gt; 0 :
            print &quot;Found an image&quot;
            sys.stdout.flush()
            links = imagediv[0].findAll(&#039;a&#039;)
            if len(links) &gt; 1:
                link = urllib.unquote(links[(len(links)-1)][&#039;href&#039;][31:])
            else:
                link = urllib.unquote(links[0][&#039;href&#039;][31:])
            print link
            sys.stdout.flush()
            try:
                response = urllib2.urlopen(link)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                id = str(file)[:7]
                thumbnail = &#039;http://www.adb.online.anu.edu.au&#039; + imagediv[0].img[&#039;src&#039;].lstrip(&#039;.&#039;)
                # print thumbnail
                title = imagediv[0].p.contents[0].split(&#039;,&#039;)[0].strip().replace(&#039; - &#039;, &#039;-&#039;)
                title = title.encode(&#039;utf-8&#039;)
                print &quot;Processing: &quot; + title
                sys.stdout.flush()
                html = response.read()
                imgsoup = BeautifulSoup(html)
                if (link.find(&#039;sl.nsw&#039;) &gt; -1):
                    if (link.find(&#039;ebindshow.pl&#039;) == -1): # Not thumbnail pages - see John Bingle
                        if (html.find(&#039;Higher quality image&#039;) != -1):
                            img = imgsoup.findAll(alt=&quot;Higher quality image&quot;)[0].parent[&#039;href&#039;].split(&#039;?&#039;)[1]
                            #img = imgsoup.td.a[&#039;href&#039;].split(&#039;?&#039;)[1]
                        else:
                            img = imgsoup.table.findAll(&#039;tr&#039;)[2].img[&#039;src&#039;]
                        repository = &quot;State Library of NSW&quot;
                elif (link.find(&#039;slv.vic&#039;) &gt; -1):
                    img = imgsoup.findAll(id=&#039;ImageDisplay&#039;)[0].img[&#039;src&#039;]
                    repository = &quot;State Library of Victoria&quot;
                elif (link.find(&#039;slsa.sa&#039;) &gt; -1):
                    img = imgsoup.findAll(&#039;td&#039;)[1].img[&#039;src&#039;]
                    img = link[:link.rfind(&#039;/&#039;)+1] + img
                    repository = &quot;State Library of SA&quot;
                elif (link.find(&#039;nla.gov&#039;) &gt; -1):
                    img = link + &#039;-v&#039;
                    repository = &quot;National Library of Australia&quot;
                elif (link.find(&#039;naa.gov&#039;) &gt; -1):
                    barcode = link[link.rfind(&#039;=&#039;)+1:]
                    img = &quot;http://naa16.naa.gov.au/rs_images/ShowImage.php?B=%s&amp;T=P&quot; % barcode
                    repository = &quot;National Archives of Australia&quot;
                elif (link.find(&#039;territorystories.nt.gov&#039;) &gt; -1):
                    img = imgsoup.table.img[&#039;src&#039;]
                    repository = &quot;Northern Territory Library&quot;
                elif (link.find(&#039;statelibrary.tas.gov&#039;) &gt; -1):
                    if (html.find(&#039;No matches were found&#039;) == -1):
                        img =imgsoup.blockquote.img[&#039;src&#039;]
                        repository = &quot;State Library of Tasmania&quot;
                elif (link.find(&#039;slq.qld.gov&#039;) &gt; -1):
                    img = imgsoup.findAll(attrs={&quot;class&quot;:&quot;pictureback&quot;})[0].a[&#039;onclick&#039;]
                    #img = img[img.find(&#039;http&#039;):img.find(]
                    img = re.search(&#039;http://[\w\d\/\.]*.jpg&#039;, img).group()
                    repository = &quot;State Library of Queensland&quot;
                if (len(img) &gt; 0):
                    f.write(&quot;&lt;item&gt;\n&quot;)
                    f.write(&quot;&lt;guid isPermaLink=&#039;false&#039;&gt;%s&lt;/guid&gt;\n&quot; % id)
                    f.write(&quot;&lt;title&gt;%s -- %s&lt;/title&gt;\n&quot; % (title, repository))
                    f.write(&quot;&lt;link&gt;http://www.adb.online.anu.edu.au/biogs/%sb.htm&lt;/link&gt;\n&quot; % id)
                    f.write(&quot;&lt;media:thumbnail url=&#039;%s&#039; /&gt;\n&quot; % thumbnail.replace(&#039;&amp;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;media:content url=&#039;%s&#039; type=&#039;image/jpeg&#039; /&gt;\n&quot; % img.replace(&#039;&amp;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;/item&gt;\n&quot;)
                    f.flush()
                    print &quot;Success!&quot;
                    sys.stdout.flush()
                img = &quot;&quot;
    f.write(&quot;&lt;/channel&gt;\n&quot;)
    f.write(&quot;&lt;/rss&gt;\n&quot;)
    f.close()
</pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A wordled Constitution</title>
		<link>http://discontents.com.au/shoebox/web/a-wordled-constitution</link>
		<comments>http://discontents.com.au/shoebox/web/a-wordled-constitution#comments</comments>
		<pubDate>Mon, 22 Dec 2008 11:42:59 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[web]]></category>
		<category><![CDATA[constitution]]></category>
		<category><![CDATA[federation]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=394</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=A+wordled+Constitution&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=web&amp;rft.source=discontents&amp;rft.date=2008-12-22&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/web/a-wordled-constitution&amp;rft.language=English"></span>
If you haven&#8217;t played with Wordle yet, you should. Feed it your latest article, your thesis, your blog and see what emerges from the cloud. Some months ago I wordled the Australian Constitution (as you do). Wordle&#8217;s expert legal analysis offers a fairly positive assessment of our federal system, suggesting that Commonwealth and state powers [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=A+wordled+Constitution&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=web&amp;rft.source=discontents&amp;rft.date=2008-12-22&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/web/a-wordled-constitution&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=394"><!-- &nbsp; --></abbr>
<div class="wp-caption alignright" style="width: 170px"><a href="http://www.wordle.net/gallery/wrdl/85208/Constitution_of_Australia"><img alt="Wordles interpretation of the Australian Constitution" src="http://www.wordle.net/thumb/wrdl/85208/Constitution_of_Australia" title="The Australian Constitution" width="160" height="120" /></a><p class="wp-caption-text">Wordle&#39;s interpretation of the Australian Constitution</p></div>
<p>If you haven&#8217;t played with <a href="http://www.wordle.net">Wordle</a> yet, you should. Feed it your latest article, your thesis, your blog and see what emerges from the cloud.</p>
<p>Some months ago I wordled the Australian Constitution (as you do). Wordle&#8217;s expert legal analysis offers a fairly positive assessment of our federal system, suggesting that Commonwealth and state powers are fairly well balanced. Who needs a High Court when you can just count words?</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shoebox/web/a-wordled-constitution/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Archives in 3D</title>
		<link>http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d</link>
		<comments>http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d#comments</comments>
		<pubDate>Wed, 17 Dec 2008 03:01:43 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[archives]]></category>
		<category><![CDATA[hacks]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[greasemonkey]]></category>
		<category><![CDATA[recordsearch]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=376</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Archives+in+3D&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2008-12-17&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d&amp;rft.language=English"></span>
The new version of my Greasemonkey userscript, RecordSearch Image Tools, gives RecordSearch&#8217;s digital image pages a rather new look. My previous version had done away with the tired ol &#8216;lemon-chiffon&#8217; background colour, but I decided it was time to get a bit more adventurous, so I blitzed the old design and rebuilt the page from [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Archives+in+3D&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2008-12-17&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=376"><!-- &nbsp; --></abbr>
<div id="attachment_377" class="wp-caption alignright" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2008/12/userscript-screenshot1.jpg"><img class="size-medium wp-image-377" title="userscript-screenshot1" src="http://discontents.com.au/wp-content/uploads/2008/12/userscript-screenshot1-300x288.jpg" alt="All dressed up – RecordSearch has a new look" width="300" height="288" /></a><p class="wp-caption-text">All dressed up – RecordSearch has a new look</p></div>
<p>The new version of my Greasemonkey userscript, <a href="http://userscripts.org/scripts/show/33485">RecordSearch Image Tools</a>, gives RecordSearch&#8217;s digital image pages a rather new look. My previous version had done away with the tired ol &#8216;lemon-chiffon&#8217; background colour, but I decided it was time to get a bit more adventurous, so I blitzed the old design and rebuilt the page from the beginning.</p>
<p>As you can see from the screenshot, I&#8217;ve tried to give the images as much as the screen as possible. I&#8217;ve also created a consistent set of navigation buttons, and improved the functionality in various ways.<span id="more-376"></span></p>
<div id="attachment_379" class="wp-caption alignright" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2008/12/3dwall-screenshot1.jpg"><img class="size-medium wp-image-379" title="3dwall-screenshot1" src="http://discontents.com.au/wp-content/uploads/2008/12/3dwall-screenshot1-300x187.jpg" alt="Archives in 3D – CEDTs from NAA: ST84/1, 1906/21-30" width="300" height="187" /></a><p class="wp-caption-text">Archives in 3D – CEDTs from NAA: ST84/1, 1906/21-30</p></div>
<p>But the most exciting thing is that I&#8217;ve worked out how to feed the images to the fabulous CoolIris 3D wall. My previous version used the javascript version of CoolIris, which displayed the images as a flat (but still very nice) slideshow. But now, if you have the CoolIris plugin installed you can zoom, pan, fly through the file, dipping in and out as you so desire. It&#8217;s a new way of looking at archives.</p>
<div id="attachment_380" class="wp-caption alignright" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2008/12/3dwall-screenshot2.jpg"><img class="size-medium wp-image-380" title="3dwall-screenshot2" src="http://discontents.com.au/wp-content/uploads/2008/12/3dwall-screenshot2-300x187.jpg" alt="You can zoom in and out, even see a complete file on a single screen – B2455, WRAGGE C L E" width="300" height="187" /></a><p class="wp-caption-text">You can zoom in and out, even see a complete file on a single screen – NAA: B2455, WRAGGE C L E</p></div>
<p>To try for yourself you need to have <a href="http://www.mozilla.com/firefox/">Firefox</a> with the <a href="http://cooliris.com/">Cooliris plugin</a> installed. Then you need to get the <a href="https://addons.mozilla.org/firefox/addon/748">Greasemonkey extension</a> and, finally, install <a href="http://userscripts.org/scripts/show/33485">my userscript</a>. Then just dive into RecordSearch, find a digitised file and enjoy!</p>
<p><em>File links:</em></p>
<ul>
<li><a href="http://www.aa.gov.au/cgi-bin/Search?O=I&amp;Number=7473965">NAA: ST84/1, 1906/21-30</a></li>
<li><a href="http://www.aa.gov.au/cgi-bin/Search?O=I&amp;Number=3445411">NAA: B2455, WRAGGE C L E</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

