<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>discontents &#187; python</title>
	<atom:link href="http://discontents.com.au/tag/python/feed" rel="self" type="application/rss+xml" />
	<link>http://discontents.com.au</link>
	<description>working for the triumph of content over form, ideas over control, people over systems</description>
	<lastBuildDate>Tue, 24 Jan 2012 20:57:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>the real face of white australia</title>
		<link>http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia</link>
		<comments>http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia#comments</comments>
		<pubDate>Tue, 20 Sep 2011 14:42:16 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[archives]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[facial detection]]></category>
		<category><![CDATA[invisibleaustralians]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=1323</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=the+real+face+of+white+australia&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-09-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia&amp;rft.language=English"></span>
In many of the presentations I&#8217;ve given in recent times I&#8217;ve managed to include a question raised by Tim Hitchcock in his chapter in The Virtual Representation of the Past. Tim asks: What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=the+real+face+of+white+australia&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2011-09-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=1323"><!-- &nbsp; --></abbr>
<p>In many of the presentations I&#8217;ve given in recent times I&#8217;ve managed to include a question raised by Tim Hitchcock in his chapter in <em>The Virtual Representation of the Past</em>. Tim asks:</p>
<blockquote><p>What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?</p></blockquote>
<p>The idea of turning archival systems on their head to expose the people rather than the bureaucracy is what motivates Kate Bagnall and I in our attempts to make the <a href="http://invisibleaustralians.org">Invisible Australians</a> project into a reality.</p>
<p><em>Invisible Australians</em> aims to liberate the lives of those who suffered under the restrictions of the White Australia Policy from the rich archival holdings of the National Archives of Australia and elsewhere.</p>
<p>We always knew that the portrait photographs, included on a range of government documents, would provide a compelling perspective on these lives, but we weren&#8217;t quite sure how we were going to extract them. Up until last weekend, I&#8217;d assumed that we&#8217;d develop a crowdsourcing tool that contributors would use to mark-up the photos.</p>
<p>Now I&#8217;m not so sure.</p>
<p>In the space of a couple of days I&#8217;ve extracted over 7,000 photographs and built an application to browse them &#8212; here is <a href="http://invisibleaustralians.org/faces/">the real face of White Australia</a>&#8230;</p>
<p><a href="http://invisibleaustralians.org/faces/"><img src="http://discontents.com.au/wp-content/uploads/2011/09/real_face-250x182.jpg" alt="" title="real_face" width="250" height="182" class="aligncenter size-medium wp-image-1325" /></a></p>
<p>How did I do it? Paul Hagon, at the National Library of Australia, <a href="http://www.paulhagon.com/blog/2010/03/11/everything-i-know-about-cataloguing-i-learned-from-watching-james-bond/">gave a presentation</a> last year in which he explored the possibilities of facial detection in developing access to photographic collections. The idea lodged in my brain somewhere and a few days ago I started to poke around looking to see how practical it might be for <em>Invisible Australians</em>.</p>
<p>It didn&#8217;t take long to find <a href="http://creatingwithcode.com/howto/face-detection-in-static-images-with-python/">a python script</a> that used the <a href="http://sourceforge.net/projects/opencvlibrary/">OpenCV library</a> to detect faces in photographs. I tried the script on a few of the NAA documents and was impressed &#8212; there were a few false positives, but the faces were being found!</p>
<p>So then the excitement kicked in. I modified the script so that instead of just finding the coordinates of faces it would enlarge the selected area by 50px on each side and then crop the image. This did a great job of extracting the portraits. I tweaked a few of the settings as well to try and reduce the number of false positives. Eventually, I developed a two-pass system that repeated the detection process after the image had been cropped and it&#8217;s contrast adjusted. This seemed to weed out a few more errors. You can <a href="https://github.com/wragge/Facial-detection">find the code</a> on GitHub.</p>
<p>Once the script was working I had to assemble the documents. I already had a basic harvester that would retrieve both the file metadata and digitised images for any series in the NAA database. Acting on Kate&#8217;s advice, I pointed it at series <a href="http://www.naa.gov.au/cgi-bin/Search?Number=ST84/1">ST84/1</a> and downloaded 12,502 page images.</p>
<p>All I then had to do was loop the facial detection script over the images. Simple! The only problem was that my 3-year-old laptop wasn&#8217;t quite up to the task. As it&#8217;s CPU temperature rose and rose, I was forced to employ a special high-tech cooling system.</p>
<div id="attachment_1329" class="wp-caption aligncenter" style="width: 260px"><a href="http://discontents.com.au/wp-content/uploads/2011/09/cooling.jpg"><img src="http://discontents.com.au/wp-content/uploads/2011/09/cooling-250x186.jpg" alt="" title="cooling" width="250" height="186" class="size-medium wp-image-1329" /></a><p class="wp-caption-text">Keeping my laptop alive...</p></div>
<p>But after running for several hours, my faithful old laptop finally worked it&#8217;s way through all the documents. The result was a directory full of 11,170 cropped images.</p>
<div id="attachment_1332" class="wp-caption aligncenter" style="width: 260px"><a href="http://discontents.com.au/wp-content/uploads/2011/09/faces_dir.jpg"><img src="http://discontents.com.au/wp-content/uploads/2011/09/faces_dir-250x147.jpg" alt="" title="faces_dir" width="250" height="147" class="size-medium wp-image-1332" /></a><p class="wp-caption-text">The results</p></div>
<p>There were still quite a lot of false positives and so I simply worked my way through the files, manually deleting the errors. I ended up with 7,247 photos of people. That&#8217;s a strike rate of nearly 65% which seems pretty good. The classifier, which does the actual facial detection, was probably trained on conventional photographs rather than on the mixed-format documents I was feeding it.</p>
<p>Then it was just a matter of building a web app to display the portraits. I used Django for the backend work of managing the metadata and delivering the content, while the interface was built using a combination or <a href="http://isotope.metafizzy.co/index.html">Isotope</a>, <a href="http://www.infinite-scroll.com/">Infinite Scroll</a> and <a href="http://fancybox.net/">FancyBox</a>.</p>
<p>It&#8217;s important to note that the portraits provide a way of exploring the records themselves. If you click on a face you see a copy of the document from which the photo was extracted. A link is provided to examine the full context of the image in RecordSearch. This is not just an exhibition, it&#8217;s a finding aid.</p>
<p>What next? There are many more of these documents to be harvested and processed (and many more still yet to be digitised). I will be adding more series as I can (though I might have to wait until I can afford a new computer!). I&#8217;d also like to explore the possibilities of facial or object detection a bit more. Could I train my own classifier? Could I detect handprints, or even classify the type of form?</p>
<p>In the meantime, I think our experimental browser helps us to understand why the <em>Invisible Australians</em> project is so important &#8212; you look at their faces and you simply want to know more. Who are they? What were their lives like?</p>
<p>UPDATE: For more on the photos and the issues they raise, see <a href="http://chineseaustralia.org/?cat=62">Kate Bagnall&#8217;s posts</a> over at the <a href="http://chineseaustralia.org/">Tiger&#8217;s Mouth</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shoebox/archives-shoebox/the-real-face-of-white-australia/feed</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Headline roulette</title>
		<link>http://discontents.com.au/shed/experiments/headline-roulette</link>
		<comments>http://discontents.com.au/shed/experiments/headline-roulette#comments</comments>
		<pubDate>Tue, 23 Mar 2010 12:26:29 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[games]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[NLA]]></category>
		<category><![CDATA[Piston]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[screen scraping]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=834</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Headline+roulette&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2010-03-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/headline-roulette&amp;rft.language=English"></span>
I&#8217;ve been doing a fair bit of coding in recent weeks and I thought I&#8217;d better write a few details down before I forget about them. As previously noted, I&#8217;ve been gathering together various historical data sets for a project at the National Museum of Australia. One resource that I was keen on including was [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Headline+roulette&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2010-03-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/headline-roulette&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=834"><!-- &nbsp; --></abbr>
<p>I&#8217;ve been doing a fair bit of coding in recent weeks and I thought I&#8217;d better write a few details down before I forget about them.</p>
<p>As previously noted, I&#8217;ve been gathering together various historical data sets for a project at the National Museum of Australia. One resource that I was keen on including was the fantastic <a href="http://newspapers.nla.gov.au/ndp/del/home">Australian Newspapers</a> project at the National Library of Australia. What I had in mind was being able to give a sense of context to any historical event by calling up the headlines for that particular time.</p>
<p>Unfortunately there&#8217;s no API for the newspapers project (or Trove in general), though apparently it&#8217;s in the works. So I had to reverse engineer the advanced search page to work out the various query options, and then build a screen scraper to harvest the results. I played around with the search options a bit to fine tune the results, finally deciding to limit them to &#8216;news&#8217; articles with more than 1000 words. Annoyingly, only 10 results are returned at a time.</p>
<p>I had hoped to parse the results as xml, but a rogue &lt;br&gt; tag broke the XHTML, so I fell back on <a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> – a Python module that makes screen scraping considerably easier by tidying up HTML structures. After than it was pretty straightforward. Soon I had <a href="http://bitbucket.org/wragge/nla-newspapers/">my own Python module</a> to query the newspapers database and process the results.</p>
<p>The next step was to use the module to build a simple API that would let us quickly grab a set of headlines for a particular date and place. <a href="http://www.djangoproject.com/">Django</a> and <a href="http://bitbucket.org/jespern/django-piston/wiki/Home">Piston</a> made this easy. To see headlines from Victoria on 1 January 1901, for example:</p>
<p><a href="http://wraggelabs.com/api/newspapers/1901-01-01/nsw/">http://wraggelabs.com/api/newspapers/1901-01-01/nsw/</a></p>
<p>That was pretty cool and it started me thinking about what else I might do with the data. At first I was planning some sort of browser, like my <a href="http://wraggelabs.com/abs/">Population Browser</a>, but that seemed a bit boring. So I decided to create a simple game that grabbed a random headline and asked you to try and guess the date. After further refinement I decided to impose a limit of 10 guesses, with &#8216;higher&#8217; or &#8216;lower&#8217; prompts to get you moving in the right direction. Yes, basically it was a rip-off of The Price is Right – but an interesting, ironic and historically engaged rip-off&#8230;</p>
<p>This required me to make a change to the API and Python module so that I could retrieve a random headline. Basically it just meant generating a query based on random values for the day, month, year and state. For the interface I once again delved into JQuery&#8217;s box of tricks. With all the kerfuffle about ChatRoulette in the media, the name seemed obvious – <a href="http://wraggelabs.com/newsroulette/">Wragge&#8217;s Headline Roulette</a> was born.</p>
<div id="attachment_839" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/newsroulette/"><img class="size-medium wp-image-839" title="headline-roulette" src="http://discontents.com.au/wp-content/uploads/2010/03/headline-roulette-300x151.jpg" alt="Headline roulette screen capture" width="300" height="151" /></a><p class="wp-caption-text">Test your historical nous with Headline Roulette!</p></div>
<p>It&#8217;s a very simple little app, but a number of people have said how much fun it is. The bad news is that imminent changes to the NLA newspapers site are probably going to break it (at least in its current form). So enjoy it while you can. When the NLA makes an API available I might work on something a little more sophisticated.</p>
<p>Of course, the broader point is that there are a whole range of cultural materials out there waiting to be remixed and re-used in various forms. Get hacking&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/headline-roulette/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Out of the cube</title>
		<link>http://discontents.com.au/shed/experiments/out-of-the-cube</link>
		<comments>http://discontents.com.au/shed/experiments/out-of-the-cube#comments</comments>
		<pubDate>Fri, 26 Feb 2010 05:57:44 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[APIs]]></category>
		<category><![CDATA[datacubes]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Piston]]></category>
		<category><![CDATA[population]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[spreadsheets]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=823</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Out+of+the+cube&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2010-02-26&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/out-of-the-cube&amp;rft.language=English"></span>
For a project that I&#8217;m working on at the National Museum of Australia, I&#8217;ve started collecting various sources of date-identified data. Most recently I had a go at extracting historical population data from the Australian Bureau of Statistics. The data can all be downloaded as .xls files, but they&#8217;re not simple, flat spreadsheets – they&#8217;re [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Out+of+the+cube&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2010-02-26&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/out-of-the-cube&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=823"><!-- &nbsp; --></abbr>
<p>For a project that I&#8217;m working on at the National Museum of Australia, I&#8217;ve started collecting various sources of date-identified data. Most recently I had a go at extracting <a href="http://www.abs.gov.au/AUSSTATS/abs@.nsf/mf/3105.0.65.001">historical population data</a> from the Australian Bureau of Statistics.</p>
<p>The data can all be downloaded as .xls files, but they&#8217;re not simple, flat spreadsheets – they&#8217;re data cubes. As the name suggests, data cubes are organised along a number of dimensions. In the case of the population data it&#8217;s year, state and gender.</p>
<p>This means that you can&#8217;t just export the data to CSV and suck it into your database – first you&#8217;ve got to flatten the cube. No doubt there are other ways to do this, but I just wrote a simple python script. It uses <a href="http://pypi.python.org/pypi/xlrd">xlrd</a> to read from the spreadsheet, does a bit or reorganisation, then writes the output to a CSV file. The code, for what it&#8217;s worth, is <a href="http://bitbucket.org/wragge/abs-data-cube-processor/">available at Bitbucket</a>.</p>
<p>Once I had the CSV file I just imported it into MySQL and used Django and <a href="http://bitbucket.org/jespern/django-piston/wiki/Home">Piston</a> to build a basic API. So if you want to know the population of NSW in 1856, you just go to:</p>
<p><a href="http://wraggelabs.com/api/json/population/nsw/1856/">http://wraggelabs.com/api/json/population/nsw/1856/</a></p>
<p>The number of infant deaths in Tasmania in 1932:</p>
<p><a href="http://wraggelabs.com/api/json/infantdeaths/tas/1932/">http://wraggelabs.com/api/json/infantdeaths/tas/1932/</a></p>
<p>The number of female births in Australia in 1959:</p>
<p><a href="http://wraggelabs.com/api/json/births/australia/females/1959/">http://wraggelabs.com/api/json/births/australia/females/1959/</a></p>
<p>I&#8217;m sure you get the picture. You can change the &#8216;json&#8217; to &#8216;xml&#8217; if you&#8217;d like another flavour of data.</p>
<div id="attachment_830" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/abs/"><img class="size-medium wp-image-830" title="pop_browser" src="http://discontents.com.au/wp-content/uploads/2010/02/pop_browser-300x140.png" alt="Screenshot of population browser" width="300" height="140" /></a><p class="wp-caption-text">The API in action - a simple population browser</p></div>
<p>With an API delivering JSON you can start playing around with all sorts of fun AJAX-y stuff. To demonstrate I built a <a href="http://wraggelabs.com/abs/">simple population browser</a> using JQuery. Just drag the slider!</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/out-of-the-cube/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudy biographies and portrait walls</title>
		<link>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls</link>
		<comments>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls#comments</comments>
		<pubDate>Sat, 24 Jan 2009 08:26:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[ADB Online]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[word clouds]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=409</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
With a bit of time to play over Christmas I had a go at applying some of the techniques described at ProgrammingHistorian to the ADB Online.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=409"><!-- &nbsp; --></abbr>
<p>With a bit of time to play over Christmas I had a go at applying some of the techniques described at <a href="http://niche.uwo.ca/programming-historian/index.php"><em>ProgrammingHistorian</em></a> to the <a href="http://www.adb.online.anu.edu.au/adbonline.htm">ADB Online</a>.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had to offer as a way of improving access to the articles.</p>
<p>So I set about learning Python and was soon downloading and scraping the more than 10,000 articles that make up the ADB online.</p>
<p>My first tests revealed that the most frequent words in ADB articles were&#8230;</p>
<p style="text-align: center;"><strong>born</strong> and <strong>died</strong></p>
<p style="text-align: left;">Who&#8217;d have thought it? In a biographical dictionary?</p>
<p style="text-align: left;">After further refining the stopwords list I started to generate some useful clouds. Finally after 147 minutes of processing time, I had a <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html">word cloud</a> representing the content of all 16 volumes of the <em>Australian Dictionary of Biography</em>.</p>
<p style="text-align: left;">
<div id="attachment_559" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html"><img class="size-medium wp-image-559" title="adb-cloud-complete" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-complete-300x195.jpg" alt="The complete ADB word cloud" width="300" height="195" /></a><p class="wp-caption-text">The complete ADB word cloud</p></div>
<p style="text-align: left;"><span id="more-409"></span>The words in the cloud are linked back to the ADB&#8217;s own search engine, allowing the cloud to be used as a way of exploring the articles themselves.</p>
<p style="text-align: left;">It shows the top 200 words, but if you want to see the rest you can download the <a href="http://discontents.com.au/shed/adb/clouds/wordfreqs.txt">raw word frequency file</a> (&gt;1mb txt file).</p>
<p style="text-align: left;">What can you see? Amongst other things, John is obviously the most popular name, Sydney just edges out Melbourne as the most popular place, and burial beats cremation as the most common mode of dispatch. It&#8217;s fun to explore.</p>
<p style="text-align: left;">But of course this then set me wondering about how these frequencies might change with the development of the ADB and changes in its subjects. So I generated word clouds for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html">each volume</a> and for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html">each chronological series</a>.</p>
<p style="text-align: left;">
<div id="attachment_563" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html"><img class="size-medium wp-image-563" title="adb-cloud-volumes" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-volumes-300x195.jpg" alt="Word clouds by volume" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by volume</p></div>
<div id="attachment_564" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html"><img class="size-medium wp-image-564" title="adb-cloud-series" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-series-300x195.jpg" alt="Word clouds by series" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by series</p></div>
<p style="text-align: left;">
<p>I even added some simple Javascript slideshows so you could watch the clouds evolve.</p>
<p>One of the most obvious features in the series clouds is the gradual disappearance of &#8216;land&#8217;. It&#8217;s one of the most prominent words in the first series, but gradually fades until it disappears completely in the last.</p>
<p>After this successful foray into the world of word clouds, I began to think about other ways of visualising the ADB&#8217;s content. Many of the articles have portrait images, wouldn&#8217;t it be interesting to use the images themselves as the entry point to the biographical articles?</p>
<p>I&#8217;d already been <a href="http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d">playing with CoolIris</a>, so I decided to harvest all the portrait references and use them to create a 3D wall. The <a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html">result is pretty spectacular</a>.</p>
<div id="attachment_569" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html"><img class="size-medium wp-image-569" title="gallery" src="http://discontents.com.au/wp-content/uploads/2009/01/gallery-300x66.jpg" alt="ADB prtrait browser" width="300" height="66" /></a><p class="wp-caption-text">ADB portrait browser</p></div>
<p>Some technical details about the clouds and the portrait browser follow, for those interested in such things&#8230;</p>
<h3>Gathering your words</h3>
<p>Conveniently<em> </em>for me,<em> ProgrammingHistorian</em> uses the <em>Dictionary of Canadian Biography</em> as its main example, so there was much code that I could <span style="text-decoration: line-through;">just cut and paste</span> carefully examine and utilise.  As the examples show, it&#8217;s easy to grab a webpage and analyse its content on the fly. But I wanted to process more than 10,000 pages and I knew that I was unlikely to get it working the first time round, so I decided to download the files first and then work on them locally. PH provided a basic example, to which I added some error-handling and the necessary loops to cycle through the ADB files. Because I had a bit of inside knowledge I cheated and hard-coded the numbers of articles in each volume. If I hadn&#8217;t known this I would have had to scrape all the browse pages, pulling out the links and creating a list in individual ids – not hard, but a bit tedious. Anyway this is how it ended up:</p>
<pre><pre class="brush: python">
# download_adb.py

import urllib2, time, os, sys
import dh
items = (565, 575, 607, 526, 614, 533, 543, 723, 737, 742, 737, 759, 755, 721, 703, 714, 694, 126)
if os.path.exists(&#039;adb&#039;) == 0: os.mkdir(&#039;adb&#039;)

for v in range(0,18):
    for i in range (1,(items[v]+1)):
        if v == 0:
            filename = &#039;AS1%04db.htm&#039; % i
        else:
            filename = &#039;A%02d%04db.htm&#039; % (v, i)
        if os.path.isfile(&#039;adb/&#039; + filename) == 0:
            print &#039;Processing: &#039; + filename
            url = &#039;http://adbonline.anu.edu.au/biogs/&#039; + filename
            try:
                response = urllib2.urlopen(url)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                html = response.read()
                f = open(&#039;adb/&#039; + filename, &#039;w&#039;)
                f.write(html)
                f.close
                time.sleep(2)
        else:
            print &quot;File already downloaded&quot;
        sys.stdout.flush()</pre></pre>
<h3>Learning to count</h3>
<p>Before too long I had a directory full of about 11,000 little html files just waiting for me to begin my evil experiments. First I had to slice them up and pull out all the interesting bits. By examining the code of the pages I could see that the main content was inside a div with the id of &#8216;content&#8217;. Using the Beautiful Soup Python library, I was easily able to extract this div. But the content div also usually included a portrait image and a bibliography. Once again I dipped into Beautiful Soup to discard all the unwanted bits. The slicing and dicing went something like this:</p>
<pre><pre class="brush: python">
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)</pre></pre>
<p>Now I had the text of the article to play with. Following the PH examples it wasn&#8217;t long before I could extract word-frequency tables from a few files at a time. However, when I tried to process all the articles from a particular volume it took a verrry long time. I fiddled a bit with the code and amazed myself by dramatically improving the performance. I replaced the <em>wordListToFreqDict</em> function provided by PH with my own modified version:</p>
<pre><pre class="brush: python">
def wordListToFreqDict2(wordlist):
    worddict = dict.fromkeys(wordlist)
    wordfreq = [wordlist.count(p) for p in worddict.keys()]
    return dict(zip(worddict,wordfreq))
</pre></pre>
<p>The <code>worddict = dict.fromkeys(wordlist)</code> line made all the difference, creating a list of unique words that could then be checked against the full word list.  With this hack in place I was able to process a complete volume in a few minutes.</p>
<p>I was already using a list of stopwords provided by PH to exclude things such as &#8216;such&#8217; , &#8216;as&#8217; and &#8216;and&#8217;, but obviously a few additions were necessary. To the list of stopwords I added:</p>
<pre><pre class="brush: python">
stopwords += [&#039;january&#039;, &#039;february&#039;, &#039;march&#039;, &#039;april&#039;, &#039;may&#039;, &#039;june&#039;, &#039;july&#039;, &#039;august&#039;, &#039;september&#039;, &#039;october&#039;, &#039;november&#039;, &#039;december&#039;]
stopwords += [&#039;new&#039;, &#039;south&#039;, &#039;wales&#039;, &#039;australia&#039;, &#039;australian&#039;, &#039;victoria&#039;, &#039;south&#039;, &#039;western&#039;, &#039;queensland&#039;, &#039;tasmania&#039;]
#stopwords += [&#039;sydney&#039;, &#039;melbourne&#039;, &#039;brisbane&#039;, &#039;adelaide&#039;, &#039;perth&#039;, &#039;hobart&#039;]
stopwords += [&#039;died&#039;, &#039;born&#039;, &#039;life&#039;, &#039;lived&#039;, &#039;married&#039;, &#039;father&#039;, &#039;wife&#039;, &#039;children&#039;, &#039;son&#039;, &#039;sons&#039;, &#039;daughter&#039;, &#039;daughters&#039;, &#039;brother&#039;, &#039;brothers&#039;]
stopwords += [&#039;street&#039;, &#039;st&#039;, &#039;year&#039;, &#039;years&#039;, &#039;months&#039;, &#039;acre&#039;, &#039;acres&#039;, &#039;ha&#039;]
stopwords += [&#039;e&#039;, &#039;m&#039;, &#039;b&#039;, &#039;c&#039;, &#039;w&#039;, &#039;j&#039;, &#039;d&#039;, &#039;n&#039;, &#039;f&#039;, &#039;g&#039;, &#039;h&#039;, &#039;i&#039;, &#039;ii&#039;, &#039;l&#039;, &#039;o&#039;, &#039;p&#039;, &#039;th&#039;, &#039;r&#039;, &#039;t&#039;, &#039;u&#039;, &#039;r&#039;, &#039;nd&#039;]
</pre></pre>
<p>The first two lines should be pretty obvious. As you can see, I originally excluded names of the capital cities, but then realised that you could watch Sydney and Melbourne battle it out for pre-eminence, so I excluded the exclusion. Also out were family relations and various other words that turned up in almost every article. Cleaning out all the non-alphabetical characters from the text had left a lot of orphaned letters that had once been things like £ signs, so I had to dispose of them as well.</p>
<p>The modules for actually generating the clouds were mostly just copied from PH with a few minor changes. My complete script is here:</p>
<pre><pre class="brush: python">
# adb-text-count.py

import urllib2
import dh, os, sys, time
from BeautifulSoup import BeautifulSoup
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
filelist = dh.getFileNames(dir)

f = open(&#039;wordlist.txt&#039;, &#039;w&#039;)
for file in filelist:
    print &#039;Processing &#039; + file
    sys.stdout.flush()
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)
    text = dh.stripTags(str(content[0]))
    fullwordlist = dh.stripNonAlpha(text.lower())
    wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
    f.write(&quot; &quot;.join(wordlist))
f.close
f = open(&#039;wordlist.txt&#039;)
words = f.read()
f.close
wordlist = words.split(&quot; &quot;)
dictionary = dh.wordListToFreqDict2(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
f = open(&#039;wordfreqs.txt&#039;, &#039;w&#039;)
for s in sorteddict: f.write(str(s)+&quot;\n&quot;)
f.close
print &#039;Dictionary created&#039;
sys.stdout.flush()
# create tag cloud and open in Firefox
cloudsize = 200
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
outstring = &#039;&#039;
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
print &#039;Creating cloud&#039;
sys.stdout.flush()
for k in resorteddict:
    kfreq = k[0]
    klabel = k[1]
    klabel = dh.undecoratedHyperlink(&#039;http://adbonline.anu.edu.au/scripts/adbp-ent_search.php?ranktext=&#039; + k[1] + &#039;&amp;amp;search=Go!&#039;, k[1])
    scalingfactor = (kfreq - minfreq) / float(freqrange)
    outstring += &#039; &#039; + dh.scaledFontSizeSpan(klabel, scalingfactor) + &#039; &#039;
dh.wrapStringInHTML(&quot;html-to-tag-cloud&quot;, dh.defaultCSSDiv(outstring), &quot;Complete&quot;)
finish = time.time()
print &quot;Finished at: &quot;, time.asctime(time.localtime(finish))
print &quot;Total time: &quot;, finish - start
</pre></pre>
<h3>Biographies in 3D</h3>
<p>To display all the portrait images in CoolIris I had to harvest all the image details and then write them to a Media RSS file for CoolIris to read.</p>
<p>Extracting the details of all the thumbnail versions of the portraits in the ADB was easy using Beautiful Soup. But I also need the paths to the larger versions of the portraits stored on the sites of the repositories that hold the originals. All of these sites present the images differently, so a different scraper was needed for each of them. As yet I&#8217;ve only included major libraries and archives – I may add some more if I get the time.</p>
<p>Once the paths to the thumbnails and large versions had been harvested, it was just a matter of writing the RSS feed. Actually, I created a series of RSS files, one for each volume, linked using &#8216;rel=previous&#8217; and &#8216;rel=next&#8217; attributes. This helped speed up the loading of the gallery. For what it&#8217;s worth, the complete code is here:</p>
<pre><pre class="brush: python">
# adb-portraits.py

import socket, urllib2, urllib
import dh, os, sys, time, re
from BeautifulSoup import BeautifulSoup
# timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
for i in range(8,18):
    if (i == 17): vol = &quot;AS1&quot;
    else: vol = &quot;A%02d&quot; % i
    filelist = dh.getFileNamesByVol2(dir, vol)
    f = open(&#039;adb-portraits-%s.rss&#039; % i, &#039;w&#039;)
    f.write(&quot;&lt;?xml version=&#039;1.0&#039; encoding=&#039;utf-8&#039; standalone=&#039;yes&#039;?&gt;\n&quot;)
    f.write(&quot;&lt;rss version=&#039;2.0&#039; xmlns:media=&#039;http://search.yahoo.com/mrss/&#039; xmlns:atom=&#039;http://www.w3.org/2005/Atom&#039;&gt;\n&quot;)
    f.write(&quot;&lt;channel&gt;\n&quot;)
    f.write(&quot;\n&quot;)
    f.write(&quot;&lt;description&gt;Portraits of individuals included in the Australian Dictionary of Biography&lt;/description&gt;\n&quot;)
    f.write (&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au&lt;/link&gt;\n&quot;)
    if (i &gt; 1):
        f.write (&quot;&lt;atom:link rel=&#039;previous&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i-1))
    if (i &lt; 17):
        f.write (&quot;&lt;atom:link rel=&#039;next&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i+1))
    for file in filelist:
        print str(file)
        sys.stdout.flush()
        g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
        html = g.read()
        g.close()
        #print html
        sys.stdout.flush()
        soup = BeautifulSoup(html)
        imagediv = soup.findAll(id=&quot;imagebox&quot;)
        if len(imagediv) &gt; 0 :
            print &quot;Found an image&quot;
            sys.stdout.flush()
            links = imagediv[0].findAll(&#039;a&#039;)
            if len(links) &gt; 1:
                link = urllib.unquote(links[(len(links)-1)][&#039;href&#039;][31:])
            else:
                link = urllib.unquote(links[0][&#039;href&#039;][31:])
            print link
            sys.stdout.flush()
            try:
                response = urllib2.urlopen(link)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                id = str(file)[:7]
                thumbnail = &#039;http://www.adb.online.anu.edu.au&#039; + imagediv[0].img[&#039;src&#039;].lstrip(&#039;.&#039;)
                # print thumbnail
                title = imagediv[0].p.contents[0].split(&#039;,&#039;)[0].strip().replace(&#039; - &#039;, &#039;-&#039;)
                title = title.encode(&#039;utf-8&#039;)
                print &quot;Processing: &quot; + title
                sys.stdout.flush()
                html = response.read()
                imgsoup = BeautifulSoup(html)
                if (link.find(&#039;sl.nsw&#039;) &gt; -1):
                    if (link.find(&#039;ebindshow.pl&#039;) == -1): # Not thumbnail pages - see John Bingle
                        if (html.find(&#039;Higher quality image&#039;) != -1):
                            img = imgsoup.findAll(alt=&quot;Higher quality image&quot;)[0].parent[&#039;href&#039;].split(&#039;?&#039;)[1]
                            #img = imgsoup.td.a[&#039;href&#039;].split(&#039;?&#039;)[1]
                        else:
                            img = imgsoup.table.findAll(&#039;tr&#039;)[2].img[&#039;src&#039;]
                        repository = &quot;State Library of NSW&quot;
                elif (link.find(&#039;slv.vic&#039;) &gt; -1):
                    img = imgsoup.findAll(id=&#039;ImageDisplay&#039;)[0].img[&#039;src&#039;]
                    repository = &quot;State Library of Victoria&quot;
                elif (link.find(&#039;slsa.sa&#039;) &gt; -1):
                    img = imgsoup.findAll(&#039;td&#039;)[1].img[&#039;src&#039;]
                    img = link[:link.rfind(&#039;/&#039;)+1] + img
                    repository = &quot;State Library of SA&quot;
                elif (link.find(&#039;nla.gov&#039;) &gt; -1):
                    img = link + &#039;-v&#039;
                    repository = &quot;National Library of Australia&quot;
                elif (link.find(&#039;naa.gov&#039;) &gt; -1):
                    barcode = link[link.rfind(&#039;=&#039;)+1:]
                    img = &quot;http://naa16.naa.gov.au/rs_images/ShowImage.php?B=%s&amp;#038;T=P&quot; % barcode
                    repository = &quot;National Archives of Australia&quot;
                elif (link.find(&#039;territorystories.nt.gov&#039;) &gt; -1):
                    img = imgsoup.table.img[&#039;src&#039;]
                    repository = &quot;Northern Territory Library&quot;
                elif (link.find(&#039;statelibrary.tas.gov&#039;) &gt; -1):
                    if (html.find(&#039;No matches were found&#039;) == -1):
                        img =imgsoup.blockquote.img[&#039;src&#039;]
                        repository = &quot;State Library of Tasmania&quot;
                elif (link.find(&#039;slq.qld.gov&#039;) &gt; -1):
                    img = imgsoup.findAll(attrs={&quot;class&quot;:&quot;pictureback&quot;})[0].a[&#039;onclick&#039;]
                    #img = img[img.find(&#039;http&#039;):img.find(]
                    img = re.search(&#039;http://[\w\d\/\.]*.jpg&#039;, img).group()
                    repository = &quot;State Library of Queensland&quot;
                if (len(img) &gt; 0):
                    f.write(&quot;&lt;item&gt;\n&quot;)
                    f.write(&quot;&lt;guid isPermaLink=&#039;false&#039;&gt;%s&lt;/guid&gt;\n&quot; % id)
                    f.write(&quot;\n&quot; % (title, repository))
                    f.write(&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au/biogs/%sb.htm&lt;/link&gt;\n&quot; % id)
                    f.write(&quot;&lt;media:thumbnail url=&#039;%s&#039; /&gt;\n&quot; % thumbnail.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;media:content url=&#039;%s&#039; type=&#039;image/jpeg&#039; /&gt;\n&quot; % img.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;/item&gt;\n&quot;)
                    f.flush()
                    print &quot;Success!&quot;
                    sys.stdout.flush()
                img = &quot;&quot;
    f.write(&quot;&lt;/channel&gt;\n&quot;)
    f.write(&quot;&lt;/rss&gt;\n&quot;)
    f.close()
</pre></pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

