<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>discontents &#187; biographies</title>
	<atom:link href="http://discontents.com.au/tag/biographies/feed" rel="self" type="application/rss+xml" />
	<link>http://discontents.com.au</link>
	<description>working for the triumph of content over form, ideas over control, people over systems</description>
	<lastBuildDate>Wed, 21 Jul 2010 23:24:54 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>(a not so) Quick catch up</title>
		<link>http://discontents.com.au/shed/a-not-so-quick-catch-up</link>
		<comments>http://discontents.com.au/shed/a-not-so-quick-catch-up#comments</comments>
		<pubDate>Fri, 07 May 2010 15:37:13 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[the shed]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[games]]></category>
		<category><![CDATA[greasemonkey]]></category>
		<category><![CDATA[identities]]></category>
		<category><![CDATA[machine tags]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[People Australia]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[userscripts]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=843</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=%28a+not+so%29+Quick+catch+up&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2010-05-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/a-not-so-quick-catch-up&amp;rft.language=English"></span>

The trained guinea pigs in the Wragge Labs bunker have been churning out all sorts of stuff in the last few months, and I&#8217;m way behind in my attempts to document their activities. So this is a bit of a catch-up post to try and commit a few pertinent details to the collective memory bank [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=%28a+not+so%29+Quick+catch+up&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2010-05-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/a-not-so-quick-catch-up&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=843"><!-- &nbsp; --></abbr>
<p>The trained guinea pigs in the Wragge Labs bunker have been churning out all sorts of stuff in the last few months, and I&#8217;m way behind in my attempts to document their activities. So this is a bit of a catch-up post to try and commit a few pertinent details to the collective memory bank before they are lost forever in the sleep-deprived fog of day-to-day existence.</p>
<h3>Identity upgrades</h3>
<p>There have been a number of major improvements to <a href="http://wraggelabs.com/identities/">Wragge&#8217;s Identity Browser</a>. Regular viewers will recall that the Identity Browser is built on top of the <a href="http://www.nla.gov.au/apps/srw/search/peopleaustralia">People Australia SRU interface</a>. You might not realise, however, that People Australia contains details of many organisations as well as people. We can only be thankful that it wasn&#8217;t called Entity Australia.</p>
<p>The first version of my Identity Browser only searched for people, but now all your corporate-entity-identification needs are also met, with only a few minor changes to the interface so-beloved by numerous generations of identity seekers. To be specific, through the wonders of drop-down technology you can choose whether you want to search for a person or an organisation. Or not. You can also just ignore that and search for everything and get back sensible results anyway. It&#8217;s your choice. Or not.</p>
<div id="attachment_864" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/identities/"><img class="size-medium wp-image-864" title="identities" src="http://discontents.com.au/wp-content/uploads/2010/05/identities-300x77.jpg" alt="" width="300" height="77" /></a><p class="wp-caption-text">Gaze in awe at the power of my dropdown</p></div>
<p>Ah pattern matching&#8230; there are few phrases so redolent of warm summer days, hidden pleasures, and the subtle delights of wildcard characters. The People Australia SRU interface was sadly lacking in the pattern matching department, but this has now been rectified. So now you mix your stems and asterixes with wild abandon. Searching for &#8216;Curtin, J*&#8217; will now retrieve all those Curtins whose names begin with &#8216;J&#8217;. Amazing isn&#8217;t it?</p>
<p>Astonishing too is the fact that the accompanying &#8216;Identify me!&#8217; bookmarklet continues to function with nary a murmur of protest. There is, however, a little bit of cleverness built-in to enhance your bookmarklet experience. If the text that you highlight has a comma in it, the Identity Browser will conclude that you&#8217;re feeding it the name of a person – ie Surname, Firstname – and will treat the Firstname as a stem. So if you highlight &#8216;Whitlam, G&#8217; and click on the bookmarklet, the Identity Browser will be kick-started into life, searching for everything that matches surname equals &#8216;Whitlam&#8217; and firstname is like &#8216;G*&#8217;. If there&#8217;s no comma – ie firstname secondname – then it heads off to look for either a person whose surname equals &#8217;secondname&#8217; and whose firstname is like &#8216;firstname*&#8217;, or an organisation whose name includes both &#8216;firstname&#8217; and &#8217;secondname&#8217;. Got all that?</p>
<p>Basically the idea was to try and provide some sensible defaults so you really don&#8217;t have to think about it too much.</p>
<p>I have it in my head to prepare a long and rapturous homage to the wonders of machine tags. With their sly semantic ways and easy-going nature, they offer some exciting possibilities not just for user-generated content, but user-generated meanings and user-generated relationships. But for the full, ripe pleasure of that post you will have to wait another day, for now I shall simply say that as well as RDFa, the Identity Browser provides automagically-generated machine tags.</p>
<p>Where might you use them? Flickr&#8217;s a good place to start. Try identifying the subjects and creators of Flickr photos. At the NSW Reference and Information Services Group Seminar the other day I challenged those in attendance to go forth and machine tag. Already more than 100 machine tags have been added to Flickr using my Identity Browser. Expect to hear more about the Great Flickr Machine Tag Challenge soon&#8230;</p>
<p>One more thing&#8230; try adding &#8216;.rdf&#8217; on to the end of an identity record – eg <a href="http://wraggelabs.com/identities/person/612109.rdf">http://wraggelabs.com/identities/person/612109.rdf</a>. Just an experiment at the moment&#8230;</p>
<h3>More machine tag love</h3>
<p>One night on Twitter, <a href="http://twitter.com/lifeasdaddy">@lifeasdaddy</a> pointed out that someone had started using fragments of urls from the <a href="http://trove.nla.gov.au/newspaper">NLA newspapers site</a> as tags in the <a href="http://www.powerhousemuseum.com/collection/database/?irn=244414">Powerhouse Museum&#8217;s collection database</a>. In the conversation that ensued with <a href="http://twitter.com/sebchan">@sebchan</a> and others, I suggested that the PHM could encourage this sort of rich tagging by supporting machine tags, with all their wonderful juicy semantic goodness The guinea pigs got excited as well, and before I knew it, they&#8217;d constructed a little <a href="http://semweb-helper.appspot.com/">Semweb Helper app</a>.</p>
<p>The Semweb Helper comes with its very own custom-tailored bookmarklet. If you find an article on the NLA newspapers site that you&#8217;d like to point to, just click on the bookmarklet and marvel as a range of useful machine tags are automagically generated. Then you just pick the appropriate tag, copy and paste et voila – instant semantic gratification.</p>
<div id="attachment_861" class="wp-caption aligncenter" style="width: 310px"><a href="http://semweb-helper.appspot.com/"><img class="size-medium wp-image-861" title="semweb-helper" src="http://discontents.com.au/wp-content/uploads/2010/05/semweb-helper-300x147.jpg" alt="Screenshot" width="300" height="147" /></a><p class="wp-caption-text">Try out the Semweb Helper</p></div>
<p>It&#8217;s a very simple little app, and really just a demonstration of how semantic web technologies might be made available to the masses. It was also the first time the guinea pigs had been allowed to play with the Google Apps Engine.</p>
<h3>Who am I?</h3>
<p>This short catch-up post has become something quite long and rambling. Did I mention that I&#8217;m sleep-deprived? Anyway, a recent addition to the Wragge Labs range of lifestyle accessories is <a href="http://wraggelabs.com/whoami/">&#8216;Who am I?&#8217; </a>– a simple little game that is something like a cross between hangman and Wheel of Fortune. Choosing a person at random from People Australia and the <em>Australian Dictionary of Biography</em>, &#8216;Who am I?&#8217; tests your powers of logic, stamina and historical guesstimation.</p>
<p>Your challenge is to figure out the surname of the mystery historical personage. To help you there are a series of clues, such as their birthplace and known associates. With each guess you also see a little bit more of their portrait. But beware! For ten wrong guesses are all that are permitted to any so brave as to enter upon this quest. Not eleven or twelve, but ten and ten only. To ignore this limit is to invite ridicule and disdain – do so at your peril.</p>
<div id="attachment_858" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/whoami/"><img class="size-medium wp-image-858" title="whoami" src="http://discontents.com.au/wp-content/uploads/2010/05/whoami-300x137.jpg" alt="Who am I screenshot" width="300" height="137" /></a><p class="wp-caption-text">Play Who am I?</p></div>
<p>&#8216;Who am I&#8217; builds upon some work I&#8217;ve been doing for the National Museum of Australia – looking at ways of mashing together various types of date-identified data. As part of that project I&#8217;ve built a series of APIs and have scraped, pummelled and munged data from a variety of sources.</p>
<p>What&#8217;s the point? I wonder this myself sometimes, particularly after I fling such things off into the aethernet and hear naught but a rare retweet. I am, after all, only in it for the glory, oh and the money of course. (Hmmm, I must look again at that business plan.) The point is twofold: first to highlight possibilities for the re-use and remixing of cultural data; second, to play with game-based models for discovery and exploration of cultural resources; and&#8230; err&#8230; thirdly just to try building something a little different.</p>
<p>Of course, if you like &#8216;Who am I?&#8217; you will probably also want to try <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a>&#8230;</p>
<h3>Headline Roulette Reprieve</h3>
<p>At the end of <a href="http://discontents.com.au/shed/experiments/headline-roulette">our last instalment</a>, the future of <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a> seemed in dire peril. Changes to the National Library of Australia web site threatened its very existence. Did it have a future? Could it survive? And did anybody care?</p>
<p>As we pick up the story oblivion looms. The feared changes are confirmed, but just as all seems lost&#8230; is it? Could it be? Yes, an advanced search facility is added to the newspapers site within Trove. Sensing this may be their only opportunity, the guinea pigs leap into action, building <a href="http://bitbucket.org/wragge/nla-newspapers-scraper">a new screen-scraper</a>, saving Headline Roulette from doom, and setting the world upon the path to a safer, happier future.</p>
<p>In short, Headline Roulette will live on&#8230; so enjoy.</p>
<h3>Handing out some presents</h3>
<p>My head is easily turned by flattery and praise. Yes, I really am so shallow and so vain. But this means that if people say nice things to me, I&#8217;m inclined to give them presents.</p>
<p>As well as doing exciting things in the web 2.0 realm for the PROV, <a href="http://twitter.com/asaletourneau">@asaletourneau</a> leaves nice comments on this blog. So he earned himself a present. It&#8217;s not much, but I <a href="http://userscripts.org/scripts/show/71421">built a userscript</a> that displays photos from the PROV site in a neat little slideshow (it&#8217;s the non-3D javascript version of CoolIris). Install Greasemonkey, get the userscript and <a href="http://proarchives.imagineering.com.au/index_search.asp?searchid=41">try it out</a> (just do a search, then click on the &#8216;Browse as slideshow&#8217; button&#8217;).</p>
<div id="attachment_852" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2010/05/prov-slideshow.jpg"><img class="size-medium wp-image-852" title="prov-slideshow" src="http://discontents.com.au/wp-content/uploads/2010/05/prov-slideshow-300x187.jpg" alt="Screen capture of slideshow" width="300" height="187" /></a><p class="wp-caption-text">PROV transport photos in a pretty slideshow</p></div>
<p>The State Library of NSW, or more specifically <a href="http://www.twitter.com/ellenforsyth">@ellenforsyth</a>, also earned my favour by inviting me to rave on about Linked Data at the afore-mentioned NSW RISG seminar. As a result, I added support for the SLNSW photo collections to my <a href="http://discontents.com.au/shoebox/archives-shoebox/harvesting-context-1">Flickr Context Harvester</a> userscript. Well&#8230; it&#8217;s the thought that counts, right? Once again – install Greasemonkey, <a href="http://userscripts.org/scripts/show/56135">get the userscript</a> and then <a href="http://acms.sl.nsw.gov.au/item/itemDetailPaged.aspx?itemID=447435">try it out</a>.</p>
<div id="attachment_855" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2010/05/slnsw-flickr.jpg"><img class="size-medium wp-image-855" title="slnsw-flickr" src="http://discontents.com.au/wp-content/uploads/2010/05/slnsw-flickr-300x181.jpg" alt="Flickr context harvestr screenshot" width="300" height="181" /></a><p class="wp-caption-text">The Flickr Context Harvester in action</p></div>
<h3>And coming up&#8230;</h3>
<p>Stay tuned for more on the Great Flickr Machine Tag Challenge, screencasts demonstrating my Identity Browser, some playing with relationships, and much much more. But right now the squirming baby on my lap needs a nappy change&#8230;</p>
<p>Did I mention that I&#8217;m sleep deprived?</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/a-not-so-quick-catch-up/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cloudy biographies and portrait walls</title>
		<link>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls</link>
		<comments>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls#comments</comments>
		<pubDate>Sat, 24 Jan 2009 08:26:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[ADB Online]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[word clouds]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=409</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>

With a bit of time to play over Christmas I had a go at applying some of the techniques described at ProgrammingHistorian to the ADB Online.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=409"><!-- &nbsp; --></abbr>
<p>With a bit of time to play over Christmas I had a go at applying some of the techniques described at <a href="http://niche.uwo.ca/programming-historian/index.php"><em>ProgrammingHistorian</em></a> to the <a href="http://www.adb.online.anu.edu.au/adbonline.htm">ADB Online</a>.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had to offer as a way of improving access to the articles.</p>
<p>So I set about learning Python and was soon downloading and scraping the more than 10,000 articles that make up the ADB online.</p>
<p>My first tests revealed that the most frequent words in ADB articles were&#8230;</p>
<p style="text-align: center;"><strong>born</strong> and <strong>died</strong></p>
<p style="text-align: left;">Who&#8217;d have thought it? In a biographical dictionary?</p>
<p style="text-align: left;">After further refining the stopwords list I started to generate some useful clouds. Finally after 147 minutes of processing time, I had a <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html">word cloud</a> representing the content of all 16 volumes of the <em>Australian Dictionary of Biography</em>.</p>
<p style="text-align: left;">
<div id="attachment_559" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html"><img class="size-medium wp-image-559" title="adb-cloud-complete" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-complete-300x195.jpg" alt="The complete ADB word cloud" width="300" height="195" /></a><p class="wp-caption-text">The complete ADB word cloud</p></div>
<p style="text-align: left;"><span id="more-409"></span>The words in the cloud are linked back to the ADB&#8217;s own search engine, allowing the cloud to be used as a way of exploring the articles themselves.</p>
<p style="text-align: left;">It shows the top 200 words, but if you want to see the rest you can download the <a href="http://discontents.com.au/shed/adb/clouds/wordfreqs.txt">raw word frequency file</a> (&gt;1mb txt file).</p>
<p style="text-align: left;">What can you see? Amongst other things, John is obviously the most popular name, Sydney just edges out Melbourne as the most popular place, and burial beats cremation as the most common mode of dispatch. It&#8217;s fun to explore.</p>
<p style="text-align: left;">But of course this then set me wondering about how these frequencies might change with the development of the ADB and changes in its subjects. So I generated word clouds for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html">each volume</a> and for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html">each chronological series</a>.</p>
<p style="text-align: left;">
<div id="attachment_563" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html"><img class="size-medium wp-image-563" title="adb-cloud-volumes" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-volumes-300x195.jpg" alt="Word clouds by volume" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by volume</p></div>
<div id="attachment_564" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html"><img class="size-medium wp-image-564" title="adb-cloud-series" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-series-300x195.jpg" alt="Word clouds by series" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by series</p></div>
<p style="text-align: left;">
<p>I even added some simple Javascript slideshows so you could watch the clouds evolve.</p>
<p>One of the most obvious features in the series clouds is the gradual disappearance of &#8216;land&#8217;. It&#8217;s one of the most prominent words in the first series, but gradually fades until it disappears completely in the last.</p>
<p>After this successful foray into the world of word clouds, I began to think about other ways of visualising the ADB&#8217;s content. Many of the articles have portrait images, wouldn&#8217;t it be interesting to use the images themselves as the entry point to the biographical articles?</p>
<p>I&#8217;d already been <a href="http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d">playing with CoolIris</a>, so I decided to harvest all the portrait references and use them to create a 3D wall. The <a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html">result is pretty spectacular</a>.</p>
<div id="attachment_569" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html"><img class="size-medium wp-image-569" title="gallery" src="http://discontents.com.au/wp-content/uploads/2009/01/gallery-300x66.jpg" alt="ADB prtrait browser" width="300" height="66" /></a><p class="wp-caption-text">ADB portrait browser</p></div>
<p>Some technical details about the clouds and the portrait browser follow, for those interested in such things&#8230;</p>
<h3>Gathering your words</h3>
<p>Conveniently<em> </em>for me,<em> ProgrammingHistorian</em> uses the <em>Dictionary of Canadian Biography</em> as its main example, so there was much code that I could <span style="text-decoration: line-through;">just cut and paste</span> carefully examine and utilise.  As the examples show, it&#8217;s easy to grab a webpage and analyse its content on the fly. But I wanted to process more than 10,000 pages and I knew that I was unlikely to get it working the first time round, so I decided to download the files first and then work on them locally. PH provided a basic example, to which I added some error-handling and the necessary loops to cycle through the ADB files. Because I had a bit of inside knowledge I cheated and hard-coded the numbers of articles in each volume. If I hadn&#8217;t known this I would have had to scrape all the browse pages, pulling out the links and creating a list in individual ids – not hard, but a bit tedious. Anyway this is how it ended up:</p>
<pre><pre class="brush: python">
# download_adb.py

import urllib2, time, os, sys
import dh
items = (565, 575, 607, 526, 614, 533, 543, 723, 737, 742, 737, 759, 755, 721, 703, 714, 694, 126)
if os.path.exists(&#039;adb&#039;) == 0: os.mkdir(&#039;adb&#039;)

for v in range(0,18):
    for i in range (1,(items[v]+1)):
        if v == 0:
            filename = &#039;AS1%04db.htm&#039; % i
        else:
            filename = &#039;A%02d%04db.htm&#039; % (v, i)
        if os.path.isfile(&#039;adb/&#039; + filename) == 0:
            print &#039;Processing: &#039; + filename
            url = &#039;http://adbonline.anu.edu.au/biogs/&#039; + filename
            try:
                response = urllib2.urlopen(url)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                html = response.read()
                f = open(&#039;adb/&#039; + filename, &#039;w&#039;)
                f.write(html)
                f.close
                time.sleep(2)
        else:
            print &quot;File already downloaded&quot;
        sys.stdout.flush()</pre></pre>
<h3>Learning to count</h3>
<p>Before too long I had a directory full of about 11,000 little html files just waiting for me to begin my evil experiments. First I had to slice them up and pull out all the interesting bits. By examining the code of the pages I could see that the main content was inside a div with the id of &#8216;content&#8217;. Using the Beautiful Soup Python library, I was easily able to extract this div. But the content div also usually included a portrait image and a bibliography. Once again I dipped into Beautiful Soup to discard all the unwanted bits. The slicing and dicing went something like this:</p>
<pre><pre class="brush: python">
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)</pre></pre>
<p>Now I had the text of the article to play with. Following the PH examples it wasn&#8217;t long before I could extract word-frequency tables from a few files at a time. However, when I tried to process all the articles from a particular volume it took a verrry long time. I fiddled a bit with the code and amazed myself by dramatically improving the performance. I replaced the <em>wordListToFreqDict</em> function provided by PH with my own modified version:</p>
<pre><pre class="brush: python">
def wordListToFreqDict2(wordlist):
    worddict = dict.fromkeys(wordlist)
    wordfreq = [wordlist.count(p) for p in worddict.keys()]
    return dict(zip(worddict,wordfreq))
</pre></pre>
<p>The <code>worddict = dict.fromkeys(wordlist)</code> line made all the difference, creating a list of unique words that could then be checked against the full word list.  With this hack in place I was able to process a complete volume in a few minutes.</p>
<p>I was already using a list of stopwords provided by PH to exclude things such as &#8217;such&#8217; , &#8216;as&#8217; and &#8216;and&#8217;, but obviously a few additions were necessary. To the list of stopwords I added:</p>
<pre><pre class="brush: python">
stopwords += [&#039;january&#039;, &#039;february&#039;, &#039;march&#039;, &#039;april&#039;, &#039;may&#039;, &#039;june&#039;, &#039;july&#039;, &#039;august&#039;, &#039;september&#039;, &#039;october&#039;, &#039;november&#039;, &#039;december&#039;]
stopwords += [&#039;new&#039;, &#039;south&#039;, &#039;wales&#039;, &#039;australia&#039;, &#039;australian&#039;, &#039;victoria&#039;, &#039;south&#039;, &#039;western&#039;, &#039;queensland&#039;, &#039;tasmania&#039;]
#stopwords += [&#039;sydney&#039;, &#039;melbourne&#039;, &#039;brisbane&#039;, &#039;adelaide&#039;, &#039;perth&#039;, &#039;hobart&#039;]
stopwords += [&#039;died&#039;, &#039;born&#039;, &#039;life&#039;, &#039;lived&#039;, &#039;married&#039;, &#039;father&#039;, &#039;wife&#039;, &#039;children&#039;, &#039;son&#039;, &#039;sons&#039;, &#039;daughter&#039;, &#039;daughters&#039;, &#039;brother&#039;, &#039;brothers&#039;]
stopwords += [&#039;street&#039;, &#039;st&#039;, &#039;year&#039;, &#039;years&#039;, &#039;months&#039;, &#039;acre&#039;, &#039;acres&#039;, &#039;ha&#039;]
stopwords += [&#039;e&#039;, &#039;m&#039;, &#039;b&#039;, &#039;c&#039;, &#039;w&#039;, &#039;j&#039;, &#039;d&#039;, &#039;n&#039;, &#039;f&#039;, &#039;g&#039;, &#039;h&#039;, &#039;i&#039;, &#039;ii&#039;, &#039;l&#039;, &#039;o&#039;, &#039;p&#039;, &#039;th&#039;, &#039;r&#039;, &#039;t&#039;, &#039;u&#039;, &#039;r&#039;, &#039;nd&#039;]
</pre></pre>
<p>The first two lines should be pretty obvious. As you can see, I originally excluded names of the capital cities, but then realised that you could watch Sydney and Melbourne battle it out for pre-eminence, so I excluded the exclusion. Also out were family relations and various other words that turned up in almost every article. Cleaning out all the non-alphabetical characters from the text had left a lot of orphaned letters that had once been things like £ signs, so I had to dispose of them as well.</p>
<p>The modules for actually generating the clouds were mostly just copied from PH with a few minor changes. My complete script is here:</p>
<pre><pre class="brush: python">
# adb-text-count.py

import urllib2
import dh, os, sys, time
from BeautifulSoup import BeautifulSoup
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
filelist = dh.getFileNames(dir)

f = open(&#039;wordlist.txt&#039;, &#039;w&#039;)
for file in filelist:
    print &#039;Processing &#039; + file
    sys.stdout.flush()
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)
    text = dh.stripTags(str(content[0]))
    fullwordlist = dh.stripNonAlpha(text.lower())
    wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
    f.write(&quot; &quot;.join(wordlist))
f.close
f = open(&#039;wordlist.txt&#039;)
words = f.read()
f.close
wordlist = words.split(&quot; &quot;)
dictionary = dh.wordListToFreqDict2(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
f = open(&#039;wordfreqs.txt&#039;, &#039;w&#039;)
for s in sorteddict: f.write(str(s)+&quot;\n&quot;)
f.close
print &#039;Dictionary created&#039;
sys.stdout.flush()
# create tag cloud and open in Firefox
cloudsize = 200
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
outstring = &#039;&#039;
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
print &#039;Creating cloud&#039;
sys.stdout.flush()
for k in resorteddict:
    kfreq = k[0]
    klabel = k[1]
    klabel = dh.undecoratedHyperlink(&#039;http://adbonline.anu.edu.au/scripts/adbp-ent_search.php?ranktext=&#039; + k[1] + &#039;&amp;amp;search=Go!&#039;, k[1])
    scalingfactor = (kfreq - minfreq) / float(freqrange)
    outstring += &#039; &#039; + dh.scaledFontSizeSpan(klabel, scalingfactor) + &#039; &#039;
dh.wrapStringInHTML(&quot;html-to-tag-cloud&quot;, dh.defaultCSSDiv(outstring), &quot;Complete&quot;)
finish = time.time()
print &quot;Finished at: &quot;, time.asctime(time.localtime(finish))
print &quot;Total time: &quot;, finish - start
</pre></pre>
<h3>Biographies in 3D</h3>
<p>To display all the portrait images in CoolIris I had to harvest all the image details and then write them to a Media RSS file for CoolIris to read.</p>
<p>Extracting the details of all the thumbnail versions of the portraits in the ADB was easy using Beautiful Soup. But I also need the paths to the larger versions of the portraits stored on the sites of the repositories that hold the originals. All of these sites present the images differently, so a different scraper was needed for each of them. As yet I&#8217;ve only included major libraries and archives – I may add some more if I get the time.</p>
<p>Once the paths to the thumbnails and large versions had been harvested, it was just a matter of writing the RSS feed. Actually, I created a series of RSS files, one for each volume, linked using &#8216;rel=previous&#8217; and &#8216;rel=next&#8217; attributes. This helped speed up the loading of the gallery. For what it&#8217;s worth, the complete code is here:</p>
<pre><pre class="brush: python">
# adb-portraits.py

import socket, urllib2, urllib
import dh, os, sys, time, re
from BeautifulSoup import BeautifulSoup
# timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
for i in range(8,18):
    if (i == 17): vol = &quot;AS1&quot;
    else: vol = &quot;A%02d&quot; % i
    filelist = dh.getFileNamesByVol2(dir, vol)
    f = open(&#039;adb-portraits-%s.rss&#039; % i, &#039;w&#039;)
    f.write(&quot;&lt;?xml version=&#039;1.0&#039; encoding=&#039;utf-8&#039; standalone=&#039;yes&#039;?&gt;\n&quot;)
    f.write(&quot;&lt;rss version=&#039;2.0&#039; xmlns:media=&#039;http://search.yahoo.com/mrss/&#039; xmlns:atom=&#039;http://www.w3.org/2005/Atom&#039;&gt;\n&quot;)
    f.write(&quot;&lt;channel&gt;\n&quot;)
    f.write(&quot;\n&quot;)
    f.write(&quot;&lt;description&gt;Portraits of individuals included in the Australian Dictionary of Biography&lt;/description&gt;\n&quot;)
    f.write (&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au&lt;/link&gt;\n&quot;)
    if (i &gt; 1):
        f.write (&quot;&lt;atom:link rel=&#039;previous&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i-1))
    if (i &lt; 17):
        f.write (&quot;&lt;atom:link rel=&#039;next&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i+1))
    for file in filelist:
        print str(file)
        sys.stdout.flush()
        g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
        html = g.read()
        g.close()
        #print html
        sys.stdout.flush()
        soup = BeautifulSoup(html)
        imagediv = soup.findAll(id=&quot;imagebox&quot;)
        if len(imagediv) &gt; 0 :
            print &quot;Found an image&quot;
            sys.stdout.flush()
            links = imagediv[0].findAll(&#039;a&#039;)
            if len(links) &gt; 1:
                link = urllib.unquote(links[(len(links)-1)][&#039;href&#039;][31:])
            else:
                link = urllib.unquote(links[0][&#039;href&#039;][31:])
            print link
            sys.stdout.flush()
            try:
                response = urllib2.urlopen(link)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                id = str(file)[:7]
                thumbnail = &#039;http://www.adb.online.anu.edu.au&#039; + imagediv[0].img[&#039;src&#039;].lstrip(&#039;.&#039;)
                # print thumbnail
                title = imagediv[0].p.contents[0].split(&#039;,&#039;)[0].strip().replace(&#039; - &#039;, &#039;-&#039;)
                title = title.encode(&#039;utf-8&#039;)
                print &quot;Processing: &quot; + title
                sys.stdout.flush()
                html = response.read()
                imgsoup = BeautifulSoup(html)
                if (link.find(&#039;sl.nsw&#039;) &gt; -1):
                    if (link.find(&#039;ebindshow.pl&#039;) == -1): # Not thumbnail pages - see John Bingle
                        if (html.find(&#039;Higher quality image&#039;) != -1):
                            img = imgsoup.findAll(alt=&quot;Higher quality image&quot;)[0].parent[&#039;href&#039;].split(&#039;?&#039;)[1]
                            #img = imgsoup.td.a[&#039;href&#039;].split(&#039;?&#039;)[1]
                        else:
                            img = imgsoup.table.findAll(&#039;tr&#039;)[2].img[&#039;src&#039;]
                        repository = &quot;State Library of NSW&quot;
                elif (link.find(&#039;slv.vic&#039;) &gt; -1):
                    img = imgsoup.findAll(id=&#039;ImageDisplay&#039;)[0].img[&#039;src&#039;]
                    repository = &quot;State Library of Victoria&quot;
                elif (link.find(&#039;slsa.sa&#039;) &gt; -1):
                    img = imgsoup.findAll(&#039;td&#039;)[1].img[&#039;src&#039;]
                    img = link[:link.rfind(&#039;/&#039;)+1] + img
                    repository = &quot;State Library of SA&quot;
                elif (link.find(&#039;nla.gov&#039;) &gt; -1):
                    img = link + &#039;-v&#039;
                    repository = &quot;National Library of Australia&quot;
                elif (link.find(&#039;naa.gov&#039;) &gt; -1):
                    barcode = link[link.rfind(&#039;=&#039;)+1:]
                    img = &quot;http://naa16.naa.gov.au/rs_images/ShowImage.php?B=%s&amp;#038;T=P&quot; % barcode
                    repository = &quot;National Archives of Australia&quot;
                elif (link.find(&#039;territorystories.nt.gov&#039;) &gt; -1):
                    img = imgsoup.table.img[&#039;src&#039;]
                    repository = &quot;Northern Territory Library&quot;
                elif (link.find(&#039;statelibrary.tas.gov&#039;) &gt; -1):
                    if (html.find(&#039;No matches were found&#039;) == -1):
                        img =imgsoup.blockquote.img[&#039;src&#039;]
                        repository = &quot;State Library of Tasmania&quot;
                elif (link.find(&#039;slq.qld.gov&#039;) &gt; -1):
                    img = imgsoup.findAll(attrs={&quot;class&quot;:&quot;pictureback&quot;})[0].a[&#039;onclick&#039;]
                    #img = img[img.find(&#039;http&#039;):img.find(]
                    img = re.search(&#039;http://[\w\d\/\.]*.jpg&#039;, img).group()
                    repository = &quot;State Library of Queensland&quot;
                if (len(img) &gt; 0):
                    f.write(&quot;&lt;item&gt;\n&quot;)
                    f.write(&quot;&lt;guid isPermaLink=&#039;false&#039;&gt;%s&lt;/guid&gt;\n&quot; % id)
                    f.write(&quot;\n&quot; % (title, repository))
                    f.write(&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au/biogs/%sb.htm&lt;/link&gt;\n&quot; % id)
                    f.write(&quot;&lt;media:thumbnail url=&#039;%s&#039; /&gt;\n&quot; % thumbnail.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;media:content url=&#039;%s&#039; type=&#039;image/jpeg&#039; /&gt;\n&quot; % img.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;/item&gt;\n&quot;)
                    f.flush()
                    print &quot;Success!&quot;
                    sys.stdout.flush()
                img = &quot;&quot;
    f.write(&quot;&lt;/channel&gt;\n&quot;)
    f.write(&quot;&lt;/rss&gt;\n&quot;)
    f.close()
</pre></pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
