<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>discontents &#187; biographies</title>
	<atom:link href="http://discontents.com.au/tag/biographies/feed" rel="self" type="application/rss+xml" />
	<link>http://discontents.com.au</link>
	<description>working for the triumph of content over form, ideas over control, people over systems</description>
	<lastBuildDate>Tue, 24 Jan 2012 20:57:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Liberating lives: invisible Australians and biographical networks</title>
		<link>http://discontents.com.au/shoebox/archives-shoebox/liberating-lives</link>
		<comments>http://discontents.com.au/shoebox/archives-shoebox/liberating-lives#comments</comments>
		<pubDate>Tue, 28 Sep 2010 12:58:38 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[archives]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[invisibleaustralians]]></category>
		<category><![CDATA[linked data]]></category>
		<category><![CDATA[White Australia]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=972</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Liberating+lives%3A+invisible+Australians+and+biographical+networks&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.source=discontents&amp;rft.date=2010-09-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/liberating-lives&amp;rft.language=English"></span>
Presented at the Life of Information Symposium, 24 September 2010. Slides are available on Slideshare. This palm print belongs to a 12-year-old boy called Charlie Allen. Charlie was born in Sydney in 1896. His mother was Frances Allen (sometime sweet shop owner and brothel keeper), his father Charlie Gum (a buyer for Wing On company). [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Liberating+lives%3A+invisible+Australians+and+biographical+networks&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=archives&amp;rft.source=discontents&amp;rft.date=2010-09-28&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shoebox/archives-shoebox/liberating-lives&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=972"><!-- &nbsp; --></abbr>
<p><strong>Presented at the <a href="http://ncb.anu.edu.au/Life_of_Information">Life of Information Symposium</a>, 24 September 2010.<br />
Slides are <a href="http://www.slideshare.net/wragge/liberating-lives-invisible-australians-and-biographical-networks">available on Slideshare</a>.</strong></p>
<p><a href="http://discontents.com.au/wp-content/uploads/2010/09/Book22_no46_CharlesAllenGum_Transparent.png"><img class="alignright size-thumbnail wp-image-976" title="Charlie-Allen-palmprint" src="http://discontents.com.au/wp-content/uploads/2010/09/Book22_no46_CharlesAllenGum_Transparent-100x150.png" alt="Charlie Allen's palm print" width="100" height="150" /></a><br />
This palm print belongs to a 12-year-old boy called Charlie Allen.</p>
<p>Charlie was born in Sydney in 1896.</p>
<p>His mother was Frances Allen (sometime sweet shop owner and brothel keeper), his father Charlie Gum (a buyer for Wing On company).</p>
<p>Charlie was raised by his mother, but in 1909, at the age of 13, he was taken to China by his father.</p>
<p>His father returned to Sydney, leaving Charlie in China. He lived with relatives in the town of Shekki (inland from Hong Kong) for 6 years.</p>
<p>Charlie was homesick, but had no means of getting back to Australia. His mother attempted to enlist government help but to no avail. Charlie finally returned in 1915.</p>
<p>The following year he enlisted in First AIF (well actually he enlisted three times, and was discharged as medically unfit each time).</p>
<p>Charlie married in Sydney in 1917 and had two daughters soon after. He returned to China in 1922 for 7 months.</p>
<p>Charlie Allen died in 1938 as the result of an industrial accident. He was 41.</p>
<p>How do we know all this about Charlie Allen?</p>
<p>We know this because there are fragments of Charlie&#8217;s life scattered throughout the holdings of the National Archives of Australia.</p>
<p>The CEDT from 1909 when he left Australia with his father:<br />
<div id="attachment_981" class="wp-caption aligncenter" style="width: 202px"><a href="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1909-CEDT-front.jpg"><img src="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1909-CEDT-front-192x300.jpg" alt="Charles Allen 1909 - CEDT front" title="Charles Allen 1909 - CEDT front" width="192" height="300" class="size-medium wp-image-981" /></a><a href="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1909-CEDT-back.jpg"><img src="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1909-CEDT-back-190x300.jpg" alt="" title="Charles Allen 1909 - CEDT back" width="190" height="300" class="size-medium wp-image-987" /></a><p class="wp-caption-text">NAA: ST84/1, 1909/22/41-50</p></div><br />
A letter from his mother to Prime Minister Billy Hughes, seeking help to return Charlie to Australia:<br />
<div id="attachment_990" class="wp-caption aligncenter" style="width: 199px"><a href="http://discontents.com.au/wp-content/uploads/2010/09/gum-letter1.jpg"><img src="http://discontents.com.au/wp-content/uploads/2010/09/gum-letter1-189x300.jpg" alt="Letter to Billy Highes from Charlie&#039;s mother." title="Letter to Billy Highes from Charlie&#039;s mother." width="189" height="300" class="size-medium wp-image-990" /></a><p class="wp-caption-text">NAA: A1, 1911/13854</p></div><br />
His WWI service record:<br />
<div id="attachment_991" class="wp-caption aligncenter" style="width: 201px"><a href="http://discontents.com.au/wp-content/uploads/2010/09/gum_ww1.jpg"><img src="http://discontents.com.au/wp-content/uploads/2010/09/gum_ww1-191x300.jpg" alt="Charles Allen&#039;s WWI attestation form" title="Charles Allen&#039;s WWI attestation form" width="191" height="300" class="size-medium wp-image-991" /></a><p class="wp-caption-text">NAA: B2455, ALLEN C A</p></div><br />
An identity form relating to his trip to China in 1922:<br />
<div id="attachment_992" class="wp-caption aligncenter" style="width: 200px"><a href="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1922-form.jpg"><img src="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1922-form-190x300.jpg" alt="" title="Charles Allen 1922 - form" width="190" height="300" class="size-medium wp-image-992" /></a><p class="wp-caption-text">NAA: SP42/1, C1922/4449</p></div><br />
But of course Charlie is not alone in the archives.</p>
<p>Charlie&#8217;s father was Chinese, he was therefore categorised as a &#8216;half-caste&#8217;, as someone who was not white, and fell under the restrictions imposed by the White Australia Policy.</p>
<p>The certificate from 1909 granted Charlie an exemption to the Dictation Test. Without it, he may not have been allowed back into the country.</p>
<p>Every time one of many thousands of non-Europeans resident in Australia sought to travel overseas and return home again they needed one of these certificates.</p>
<p>We&#8217;re all of course familiar with the general outlines of the White Australia Policy, and the way it underpinned conceptions of Australia as a nation in the first half of the 20th century.</p>
<p>But what we sometimes forget is that it was also a massive bureaucratic exercise.</p>
<p>Forms and certificates were printed, issued, used and filed. Regulations were modified, guidelines were distributed and administering officers were managed and advised. Individual cases were reviewed, policy was changed and new forms and certificates were printed, issued, used and filed&#8230;</p>
<p>For example, between 1901 and 1911, 400 circulars were issued to port officers about immigration restriction. The confidential manual on immigration restriction grew from one page in 1902 to more than 200 in 1912.</p>
<p>Much of this system is now preserved in the National Archives.</p>
<p>For the years between 1902 and 1948 there remain:</p>
<ul>
<li>More than 50,000 CEDTs</li>
<li>90 shelf metres of records</li>
<li>15,000 case files</li>
</ul>
<p>And within those many thousands of files are the scattered fragments of lives such as Charlie&#8217;s &#8212; lives that were controlled, monitored and documented in a vain attempt to make Australia &#8216;white&#8217;.</p>
<p>We&#8217;ve already seen today some wonderful examples of how these fragments, these slivers of existence, can be found, extracted, aggregated and displayed. But I think it&#8217;s worth considering for a moment what happens when we do this.</p>
<p>The historian Tim Hitchcock, behind projects such as the <a href="http://www.oldbaileyonline.org/">Old Bailey Online</a> and <a href="http://www.londonlives.org/">London Lives</a>, has reflected on the impact of digitisation on our access to archives. Archives, he notes, tend to reflect the assumptions and practices of the institutions that created them.</p>
<p>But by providing new ways into these records systems, technology can undermine the power relations that persist within their structures.</p>
<p>‘What changes’, asks Tim Hitchcock, ‘when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?’</p>
<p>I don&#8217;t know, but I think we should find out, don&#8217;t you?</p>
<p>**********</p>
<p>I hope you&#8217;ve all collected a <a href="http://twitpic.com/2ovirk">mini card</a>. These themselves provide a little glimpse at the real face of White Australia and I&#8217;d invite you all to head over to the <a href="http://www.naa.gov.au">National Archives website</a>, do battle with the monster that is <a href="http://www.naa.gov.au/collection/recordsearch/index.aspx">RecordSearch</a>, and look up the file references that are on each card.</p>
<p>The cards are part of a project that <a href="http://chineseaustralia.org/?page_id=2">Kate Bagnall</a> and I are trying to develop &#8212; <a href="http://invisibleaustralians.org">Invisible Australians</a>.</p>
<p>I should note too that the cards, and most of the examples I&#8217;m showing you here today are the product of Kate&#8217;s <a href="http://trove.nla.gov.au/work/3892554">long and detailed research into Chinese-Australian families</a>. In modern project management parlance, Kate is the domain expert, while I am merely the technical resource.</p>
<p>If we look again at one of the CEDTs, we can see that there&#8217;s a lot of useful structured data:</p>
<ul>
<li>name</li>
<li>place 	of birth</li>
<li>age</li>
<li>height</li>
<li>destination</li>
<li>date 	of departure</li>
<li>name 	of ship</li>
</ul>
<p><em>Invisible Australians</em> has the modest aim of extracting this data from the 50,000+ forms in the National Archives. But of course that&#8217;s just the start, because each person might have used a number of certificates &#8212; so then it&#8217;s a matter of matching these identities.</p>
<div id="attachment_1015" class="wp-caption aligncenter" style="width: 310px"><a href="http://invisibleaustralians.org"><img src="http://discontents.com.au/wp-content/uploads/2010/09/invis_aus_1-300x224.jpg" alt="Invisible Australians" title="Invisible Australians" width="300" height="224" class="size-medium wp-image-1015" /></a><p class="wp-caption-text">http://invisibleaustralians.org</p></div>
<p>And then there are a range of other related forms, not to mention case files, alien registration documents, naturalisation applications&#8230;</p>
<p>Obviously we can&#8217;t do it alone. We&#8217;ll be creating a crowdsourcing tool to extract and link the data.</p>
<p>It&#8217;s ridiculously ambitious, totally unfunded and is likely to take over our lives.</p>
<p>Is it worth it?</p>
<p>Imagine being able to navigate the network of lives, families and relationships. To follow their journeys, to share their tragedies, to celebrate their small victories against a repressive system.</p>
<p>Imagine being able to watch them age.</p>
<div style="width:425px" id="__ss_5306053"><strong style="display:block;margin:12px 0 4px">Pauline Ah Hee and Shadee Khan</strong><object id="__sse5306053" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=lifeofinfo-photo-aging-100928075124-phpapp01&#038;stripped_title=life-of-info-photo-aging&#038;userName=wragge" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse5306053" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=lifeofinfo-photo-aging-100928075124-phpapp01&#038;stripped_title=life-of-info-photo-aging&#038;userName=wragge" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object></div>
<p>Is it worth it? We think so.</p>
<p>**********</p>
<p>For Tim Hitchcock technology opens up the possibility of writing a new history from below, exploring how the poor, the marginalised and the powerless navigated the institutions of the modern state. But it&#8217;s not just about search engines and databases. He talks about making &#8216;best use of the technology of emotions and representation &#8212; how you use words and pictures and a story to impact, not just on what people think, but what they see in their mind&#8217;s eye&#8217;.</p>
<p>In this project, the photos matter. I hope the irony in our project title is obvious.</p>
<p><a href="http://discontents.com.au/wp-content/uploads/2010/09/moo_cards.jpg"><img src="http://discontents.com.au/wp-content/uploads/2010/09/moo_cards-300x225.jpg" alt="Some of the faces of Invisible Australia" title="moo_cards" width="300" height="225" class="aligncenter size-medium wp-image-1005" /></a></p>
<p>This is the real face of White Australia.</p>
<p>The photos remind us that the project is not just about shifting data around &#8212; these are lives, these are people.</p>
<p>But this brings its own challenge, for if we are seeking to liberate these lives from the fragmentation and obscurity of bureaucratic systems then we should be asking what are we liberating them into?</p>
<p>A database?</p>
<p>This is not just an exercise in data creation and management. We also have to think carefully and creatively about issues of representation, access and discovery.</p>
<p>We have to give these lives back their freedom to associate, to have relationships, to make connections.</p>
<p>We need to embed these lives in a variety of contexts and combinations. To make room for serendipity, celebration, sadness, and yes, even play.</p>
<p>We need to bring these lives into a rich and ongoing conversation with the world.</p>
<p>But how?</p>
<p>**********</p>
<p>I&#8217;ve been working on a little experiment for the National Museum of Australia called <em><a href="http://defining.net.au/wall/">The History Wall</a>. </em>What the History Wall does is quite simple, it pulls together data on the fly from a variety of sources including <a href="https://wiki.nla.gov.au/display/peau/Home">People Australia</a>, the <a href="http://adbonline.anu.edu.au/adbonline.htm">Australian Dictionary of Biography</a>, the <a href="http://trove.nla.gov.au/newspaper">National Library&#8217;s newspapers project</a>, <a href="http://www.abs.gov.au/AUSSTATS/abs@.nsf/mf/3105.0.65.001">historical population data</a> from the Bureau of Statistics, photos from the Flickr accounts of the PowerHouse Museum and the National Archives, and the <a href="http://www.nma.gov.au/collections-search/basic">collection database</a> of the National Museum itself. It chooses randomly from all this stuff, throws the results up into the air and then displays them however they happen to fall. No two views are ever quite the same.</p>
<div id="attachment_1006" class="wp-caption aligncenter" style="width: 160px"><a href="http://defining.net.au/wall/"><img src="http://discontents.com.au/wp-content/uploads/2010/09/wall-150x300.jpg" alt="The History Wall" title="The History Wall" width="150" height="300" class="size-medium wp-image-1006" /></a><p class="wp-caption-text">http://defining.net.au/wall/</p></div>
<p>It&#8217;s something more than a timeline. To me it&#8217;s more like a celebration of context and serendipity. There&#8217;s a richness to it, a sense of discovery and fun, but there&#8217;s also fragility &#8212; next time you look it might be gone.</p>
<p>It&#8217;s a bit like history itself.</p>
<p>It&#8217;s a bit like the world.</p>
<p>How do we create spaces for our data to merge and mingle? How do we encourage the development of new contexts and connections?</p>
<p>I think the first thing we have to do is stop thinking about databases and dictionaries, registers and encyclopaedias. Don&#8217;t get me wrong, I&#8217;m not being critical of the wonderful projects we&#8217;ve seen today. I just think we can use all this work better if we stop thinking about individual resources and start developing on a web scale, on a global scale.</p>
<p>Yes, we have the technology. Time today has spared you from a detailed discourse on the Semantic Web, but I do want to focus on one aspect.</p>
<p>You may have heard of Linked Data, it&#8217;s <a href="http://www.w3.org/DesignIssues/LinkedData.html">a set of guidelines</a> to help you publish your data to the Semantic Web. There are only four basic principles and I&#8217;m only going to talk about one of them. It&#8217;s one of those deceptively simple things. You look at it and think, &#8216;yeah, ok&#8217;, but before too long it&#8217;s starting to turn your brain inside out.</p>
<blockquote><p><strong>Use URLs to identify things in the real world.</strong></p></blockquote>
<p>Yeah, ok&#8230;</p>
<p>You know what URLs are, web addresses, the things you type in your browser&#8217;s location field.</p>
<p>And hopefully you know what things in the real world are: people, places, objects, events, ideas&#8230;</p>
<p>Now you may have detected a problem here, because no matter how many times you click the refresh button, your web browser is not going to be able to use such a URL to magically deliver you the real world thing.</p>
<p>Well, unless you&#8217;re on eBay.</p>
<p>Fortunately, the Linked Data guidelines provide for a bit of technical trickery that allow your browser to retrieve not the real world thing, but some information about that thing &#8212; perhaps in the form of a web page.</p>
<p>Why bother?</p>
<p>Names are powerful.</p>
<p>We share and use names to talk about things. Computers are the same. If we use URLs to identify things in the real world, then computers can start talking about them.</p>
<p>We can define and explore real-world relationships in an online environment. We can create rich, meaningful linkages across databases, across disciplines, across the world.</p>
<p>We can start building and thinking on a web scale.</p>
<p>**********</p>
<p>Thanks to the People Australia project, I can confidently claim that this is me:</p>
<p><a href="http://nla.gov.au/nla.party-479364#foaf:Person">http://nla.gov.au/nla.party-479364#foaf:Person</a></p>
<p>I keep meaning to get it on a t-shirt.</p>
<p>The most exciting thing about People Australia is not the EAC records or the aggregation of resources &#8212; it&#8217;s the identifiers, because they enable us to say things about people anywhere on the web that computers can understand and relate back to a specific real world entity &#8212; a person.</p>
<p>You can start doing it now with <a href="http://wraggelabs.com/identities">Wragge&#8217;s Identity Browser</a>.</p>
<div id="attachment_1009" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/identities/"><img src="http://discontents.com.au/wp-content/uploads/2010/09/id_browser-300x218.jpg" alt="Wragge&#039;s Identity Browser" title="Wragge&#039;s Identity Browser" width="300" height="218" class="size-medium wp-image-1009" /></a><p class="wp-caption-text">http://wraggelabs.com/identities/</p></div>
<p>This is a little tool I built using the People Australia API. It makes it easy to find identifiers for people and organisations, and it supplies you with some code that you can drop into a blog post or web page that will tell a computer that a name relates to a thing called a &#8216;person&#8217; , that this person&#8217;s name has a certain standard form, and that this person can be uniquely identified by People Australia.</p>
<p>Even if you don&#8217;t publish a website or a blog, you can use People Australia identifiers to build semantic linkages. Wragge&#8217;s Identity Browser also creates machine tags for you. Machine tags are like normal tags but with built in semantics. When coupled with identifiers they enable you to do some pretty powerful things.</p>
<p>You could for example use machine tags in Flickr to tell computers that a certain photo depicts a person uniquely identified by People Australia. In fact, people have been doing just that.</p>
<div id="attachment_1010" class="wp-caption aligncenter" style="width: 276px"><a href="http://wraggelabs.com/fmtc/"><img src="http://discontents.com.au/wp-content/uploads/2010/09/fmtc-266x300.jpg" alt="Flickr Machine Tag Challenge" title="Flickr Machine Tag Challenge" width="266" height="300" class="size-medium wp-image-1010" /></a><p class="wp-caption-text">http://wraggelabs.com/fmtc/</p></div>
<p>The <a href="http://wraggelabs.com/fmtc/">Flickr Machine Tag Challenge</a> is a sort of scoreboard that I built to encourage people to start adding People Australia enriched machine tags to photos. More than 1200 tags have been added to over 1000 photos. Feel free to join in!</p>
<p>The point is that the technologies already exist to enable us to build web scale biographical resources. Not dictionaries or databases as we know them, but networks capable of constant expansion, elaboration, and cooperation.</p>
<p>What we need are more tools to make it simple, recipes to make it obvious, examples and applications to make it popular, and leadership to make it all seem possible.</p>
<p>**********</p>
<p>Of course most of the lives we hope to liberate through Invisible Australians will not be represented in People Australia.</p>
<p>Not yet.</p>
<p>But Invisible Australians will offer a point of aggregation and disambiguation that will enable our people to find their way from the bureaucratic recesses of the White Australia Policy to a place on the national stage.</p>
<p>And we will encourage others to do likewise. Basil can&#8217;t do all the work. The centralised system has to be fed through centres of aggregation and collaboration.</p>
<p>Similarly, there are many great resources already out there relating to Chinese-Australians. There are hordes of family and local historians compiling and publishing biographical data. We want to identify people in these resources and link to them.</p>
<p>We want to publish up to People Australia and link down to a single headstone in a lonely country cemetery.</p>
<p>But to do this we need to help people make their resources linkable. To help them create persistent, re-usable URLs, and expose their data in standard formats. To create Linked Data, even if they have no particular interest in the Semantic Web.</p>
<div id="attachment_1013" class="wp-caption aligncenter" style="width: 310px"><a href="http://invisibleaustralians.org/"><img src="http://discontents.com.au/wp-content/uploads/2010/09/invis_aus_2-300x225.jpg" alt="Invisible Australians" title="Invisible Australians" width="300" height="225" class="size-medium wp-image-1013" /></a><p class="wp-caption-text">http://invisibleaustralians.org/</p></div>
<p>Invisible Australians is not just about extracting data from archives. It&#8217;s also about working with others to build capacities and demonstrate possibilities.</p>
<p>It&#8217;s ridiculously ambitious, totally unfunded and is likely to take over our lives.</p>
<p>Is it worth it?</p>
<p>We think so.</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shoebox/archives-shoebox/liberating-lives/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>(a not so) Quick catch up</title>
		<link>http://discontents.com.au/shed/a-not-so-quick-catch-up</link>
		<comments>http://discontents.com.au/shed/a-not-so-quick-catch-up#comments</comments>
		<pubDate>Fri, 07 May 2010 15:37:13 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[the shed]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Flickr]]></category>
		<category><![CDATA[games]]></category>
		<category><![CDATA[greasemonkey]]></category>
		<category><![CDATA[identities]]></category>
		<category><![CDATA[machine tags]]></category>
		<category><![CDATA[newspapers]]></category>
		<category><![CDATA[People Australia]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[userscripts]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=843</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=%28a+not+so%29+Quick+catch+up&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2010-05-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/a-not-so-quick-catch-up&amp;rft.language=English"></span>
The trained guinea pigs in the Wragge Labs bunker have been churning out all sorts of stuff in the last few months, and I&#8217;m way behind in my attempts to document their activities. So this is a bit of a catch-up post to try and commit a few pertinent details to the collective memory bank [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=%28a+not+so%29+Quick+catch+up&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.subject=the+shed&amp;rft.source=discontents&amp;rft.date=2010-05-08&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/a-not-so-quick-catch-up&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=843"><!-- &nbsp; --></abbr>
<p>The trained guinea pigs in the Wragge Labs bunker have been churning out all sorts of stuff in the last few months, and I&#8217;m way behind in my attempts to document their activities. So this is a bit of a catch-up post to try and commit a few pertinent details to the collective memory bank before they are lost forever in the sleep-deprived fog of day-to-day existence.</p>
<h3>Identity upgrades</h3>
<p>There have been a number of major improvements to <a href="http://wraggelabs.com/identities/">Wragge&#8217;s Identity Browser</a>. Regular viewers will recall that the Identity Browser is built on top of the <a href="http://www.nla.gov.au/apps/srw/search/peopleaustralia">People Australia SRU interface</a>. You might not realise, however, that People Australia contains details of many organisations as well as people. We can only be thankful that it wasn&#8217;t called Entity Australia.</p>
<p>The first version of my Identity Browser only searched for people, but now all your corporate-entity-identification needs are also met, with only a few minor changes to the interface so-beloved by numerous generations of identity seekers. To be specific, through the wonders of drop-down technology you can choose whether you want to search for a person or an organisation. Or not. You can also just ignore that and search for everything and get back sensible results anyway. It&#8217;s your choice. Or not.</p>
<div id="attachment_864" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/identities/"><img class="size-medium wp-image-864" title="identities" src="http://discontents.com.au/wp-content/uploads/2010/05/identities-300x77.jpg" alt="" width="300" height="77" /></a><p class="wp-caption-text">Gaze in awe at the power of my dropdown</p></div>
<p>Ah pattern matching&#8230; there are few phrases so redolent of warm summer days, hidden pleasures, and the subtle delights of wildcard characters. The People Australia SRU interface was sadly lacking in the pattern matching department, but this has now been rectified. So now you mix your stems and asterixes with wild abandon. Searching for &#8216;Curtin, J*&#8217; will now retrieve all those Curtins whose names begin with &#8216;J&#8217;. Amazing isn&#8217;t it?</p>
<p>Astonishing too is the fact that the accompanying &#8216;Identify me!&#8217; bookmarklet continues to function with nary a murmur of protest. There is, however, a little bit of cleverness built-in to enhance your bookmarklet experience. If the text that you highlight has a comma in it, the Identity Browser will conclude that you&#8217;re feeding it the name of a person – ie Surname, Firstname – and will treat the Firstname as a stem. So if you highlight &#8216;Whitlam, G&#8217; and click on the bookmarklet, the Identity Browser will be kick-started into life, searching for everything that matches surname equals &#8216;Whitlam&#8217; and firstname is like &#8216;G*&#8217;. If there&#8217;s no comma – ie firstname secondname – then it heads off to look for either a person whose surname equals &#8216;secondname&#8217; and whose firstname is like &#8216;firstname*&#8217;, or an organisation whose name includes both &#8216;firstname&#8217; and &#8216;secondname&#8217;. Got all that?</p>
<p>Basically the idea was to try and provide some sensible defaults so you really don&#8217;t have to think about it too much.</p>
<p>I have it in my head to prepare a long and rapturous homage to the wonders of machine tags. With their sly semantic ways and easy-going nature, they offer some exciting possibilities not just for user-generated content, but user-generated meanings and user-generated relationships. But for the full, ripe pleasure of that post you will have to wait another day, for now I shall simply say that as well as RDFa, the Identity Browser provides automagically-generated machine tags.</p>
<p>Where might you use them? Flickr&#8217;s a good place to start. Try identifying the subjects and creators of Flickr photos. At the NSW Reference and Information Services Group Seminar the other day I challenged those in attendance to go forth and machine tag. Already more than 100 machine tags have been added to Flickr using my Identity Browser. Expect to hear more about the Great Flickr Machine Tag Challenge soon&#8230;</p>
<p>One more thing&#8230; try adding &#8216;.rdf&#8217; on to the end of an identity record – eg <a href="http://wraggelabs.com/identities/person/612109.rdf">http://wraggelabs.com/identities/person/612109.rdf</a>. Just an experiment at the moment&#8230;</p>
<h3>More machine tag love</h3>
<p>One night on Twitter, <a href="http://twitter.com/lifeasdaddy">@lifeasdaddy</a> pointed out that someone had started using fragments of urls from the <a href="http://trove.nla.gov.au/newspaper">NLA newspapers site</a> as tags in the <a href="http://www.powerhousemuseum.com/collection/database/?irn=244414">Powerhouse Museum&#8217;s collection database</a>. In the conversation that ensued with <a href="http://twitter.com/sebchan">@sebchan</a> and others, I suggested that the PHM could encourage this sort of rich tagging by supporting machine tags, with all their wonderful juicy semantic goodness The guinea pigs got excited as well, and before I knew it, they&#8217;d constructed a little <a href="http://semweb-helper.appspot.com/">Semweb Helper app</a>.</p>
<p>The Semweb Helper comes with its very own custom-tailored bookmarklet. If you find an article on the NLA newspapers site that you&#8217;d like to point to, just click on the bookmarklet and marvel as a range of useful machine tags are automagically generated. Then you just pick the appropriate tag, copy and paste et voila – instant semantic gratification.</p>
<div id="attachment_861" class="wp-caption aligncenter" style="width: 310px"><a href="http://semweb-helper.appspot.com/"><img class="size-medium wp-image-861" title="semweb-helper" src="http://discontents.com.au/wp-content/uploads/2010/05/semweb-helper-300x147.jpg" alt="Screenshot" width="300" height="147" /></a><p class="wp-caption-text">Try out the Semweb Helper</p></div>
<p>It&#8217;s a very simple little app, and really just a demonstration of how semantic web technologies might be made available to the masses. It was also the first time the guinea pigs had been allowed to play with the Google Apps Engine.</p>
<h3>Who am I?</h3>
<p>This short catch-up post has become something quite long and rambling. Did I mention that I&#8217;m sleep-deprived? Anyway, a recent addition to the Wragge Labs range of lifestyle accessories is <a href="http://wraggelabs.com/whoami/">&#8216;Who am I?&#8217; </a>– a simple little game that is something like a cross between hangman and Wheel of Fortune. Choosing a person at random from People Australia and the <em>Australian Dictionary of Biography</em>, &#8216;Who am I?&#8217; tests your powers of logic, stamina and historical guesstimation.</p>
<p>Your challenge is to figure out the surname of the mystery historical personage. To help you there are a series of clues, such as their birthplace and known associates. With each guess you also see a little bit more of their portrait. But beware! For ten wrong guesses are all that are permitted to any so brave as to enter upon this quest. Not eleven or twelve, but ten and ten only. To ignore this limit is to invite ridicule and disdain – do so at your peril.</p>
<div id="attachment_858" class="wp-caption aligncenter" style="width: 310px"><a href="http://wraggelabs.com/whoami/"><img class="size-medium wp-image-858" title="whoami" src="http://discontents.com.au/wp-content/uploads/2010/05/whoami-300x137.jpg" alt="Who am I screenshot" width="300" height="137" /></a><p class="wp-caption-text">Play Who am I?</p></div>
<p>&#8216;Who am I&#8217; builds upon some work I&#8217;ve been doing for the National Museum of Australia – looking at ways of mashing together various types of date-identified data. As part of that project I&#8217;ve built a series of APIs and have scraped, pummelled and munged data from a variety of sources.</p>
<p>What&#8217;s the point? I wonder this myself sometimes, particularly after I fling such things off into the aethernet and hear naught but a rare retweet. I am, after all, only in it for the glory, oh and the money of course. (Hmmm, I must look again at that business plan.) The point is twofold: first to highlight possibilities for the re-use and remixing of cultural data; second, to play with game-based models for discovery and exploration of cultural resources; and&#8230; err&#8230; thirdly just to try building something a little different.</p>
<p>Of course, if you like &#8216;Who am I?&#8217; you will probably also want to try <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a>&#8230;</p>
<h3>Headline Roulette Reprieve</h3>
<p>At the end of <a href="http://discontents.com.au/shed/experiments/headline-roulette">our last instalment</a>, the future of <a href="http://wraggelabs.com/newsroulette/">Headline Roulette</a> seemed in dire peril. Changes to the National Library of Australia web site threatened its very existence. Did it have a future? Could it survive? And did anybody care?</p>
<p>As we pick up the story oblivion looms. The feared changes are confirmed, but just as all seems lost&#8230; is it? Could it be? Yes, an advanced search facility is added to the newspapers site within Trove. Sensing this may be their only opportunity, the guinea pigs leap into action, building <a href="http://bitbucket.org/wragge/nla-newspapers-scraper">a new screen-scraper</a>, saving Headline Roulette from doom, and setting the world upon the path to a safer, happier future.</p>
<p>In short, Headline Roulette will live on&#8230; so enjoy.</p>
<h3>Handing out some presents</h3>
<p>My head is easily turned by flattery and praise. Yes, I really am so shallow and so vain. But this means that if people say nice things to me, I&#8217;m inclined to give them presents.</p>
<p>As well as doing exciting things in the web 2.0 realm for the PROV, <a href="http://twitter.com/asaletourneau">@asaletourneau</a> leaves nice comments on this blog. So he earned himself a present. It&#8217;s not much, but I <a href="http://userscripts.org/scripts/show/71421">built a userscript</a> that displays photos from the PROV site in a neat little slideshow (it&#8217;s the non-3D javascript version of CoolIris). Install Greasemonkey, get the userscript and <a href="http://proarchives.imagineering.com.au/index_search.asp?searchid=41">try it out</a> (just do a search, then click on the &#8216;Browse as slideshow&#8217; button&#8217;).</p>
<div id="attachment_852" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2010/05/prov-slideshow.jpg"><img class="size-medium wp-image-852" title="prov-slideshow" src="http://discontents.com.au/wp-content/uploads/2010/05/prov-slideshow-300x187.jpg" alt="Screen capture of slideshow" width="300" height="187" /></a><p class="wp-caption-text">PROV transport photos in a pretty slideshow</p></div>
<p>The State Library of NSW, or more specifically <a href="http://www.twitter.com/ellenforsyth">@ellenforsyth</a>, also earned my favour by inviting me to rave on about Linked Data at the afore-mentioned NSW RISG seminar. As a result, I added support for the SLNSW photo collections to my <a href="http://discontents.com.au/shoebox/archives-shoebox/harvesting-context-1">Flickr Context Harvester</a> userscript. Well&#8230; it&#8217;s the thought that counts, right? Once again – install Greasemonkey, <a href="http://userscripts.org/scripts/show/56135">get the userscript</a> and then <a href="http://acms.sl.nsw.gov.au/item/itemDetailPaged.aspx?itemID=447435">try it out</a>.</p>
<div id="attachment_855" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/wp-content/uploads/2010/05/slnsw-flickr.jpg"><img class="size-medium wp-image-855" title="slnsw-flickr" src="http://discontents.com.au/wp-content/uploads/2010/05/slnsw-flickr-300x181.jpg" alt="Flickr context harvestr screenshot" width="300" height="181" /></a><p class="wp-caption-text">The Flickr Context Harvester in action</p></div>
<h3>And coming up&#8230;</h3>
<p>Stay tuned for more on the Great Flickr Machine Tag Challenge, screencasts demonstrating my Identity Browser, some playing with relationships, and much much more. But right now the squirming baby on my lap needs a nappy change&#8230;</p>
<p>Did I mention that I&#8217;m sleep deprived?</p>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/a-not-so-quick-catch-up/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cloudy biographies and portrait walls</title>
		<link>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls</link>
		<comments>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls#comments</comments>
		<pubDate>Sat, 24 Jan 2009 08:26:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[ADB Online]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[word clouds]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=409</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
With a bit of time to play over Christmas I had a go at applying some of the techniques described at ProgrammingHistorian to the ADB Online.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=409"><!-- &nbsp; --></abbr>
<p>With a bit of time to play over Christmas I had a go at applying some of the techniques described at <a href="http://niche.uwo.ca/programming-historian/index.php"><em>ProgrammingHistorian</em></a> to the <a href="http://www.adb.online.anu.edu.au/adbonline.htm">ADB Online</a>.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had to offer as a way of improving access to the articles.</p>
<p>So I set about learning Python and was soon downloading and scraping the more than 10,000 articles that make up the ADB online.</p>
<p>My first tests revealed that the most frequent words in ADB articles were&#8230;</p>
<p style="text-align: center;"><strong>born</strong> and <strong>died</strong></p>
<p style="text-align: left;">Who&#8217;d have thought it? In a biographical dictionary?</p>
<p style="text-align: left;">After further refining the stopwords list I started to generate some useful clouds. Finally after 147 minutes of processing time, I had a <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html">word cloud</a> representing the content of all 16 volumes of the <em>Australian Dictionary of Biography</em>.</p>
<p style="text-align: left;">
<div id="attachment_559" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html"><img class="size-medium wp-image-559" title="adb-cloud-complete" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-complete-300x195.jpg" alt="The complete ADB word cloud" width="300" height="195" /></a><p class="wp-caption-text">The complete ADB word cloud</p></div>
<p style="text-align: left;"><span id="more-409"></span>The words in the cloud are linked back to the ADB&#8217;s own search engine, allowing the cloud to be used as a way of exploring the articles themselves.</p>
<p style="text-align: left;">It shows the top 200 words, but if you want to see the rest you can download the <a href="http://discontents.com.au/shed/adb/clouds/wordfreqs.txt">raw word frequency file</a> (&gt;1mb txt file).</p>
<p style="text-align: left;">What can you see? Amongst other things, John is obviously the most popular name, Sydney just edges out Melbourne as the most popular place, and burial beats cremation as the most common mode of dispatch. It&#8217;s fun to explore.</p>
<p style="text-align: left;">But of course this then set me wondering about how these frequencies might change with the development of the ADB and changes in its subjects. So I generated word clouds for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html">each volume</a> and for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html">each chronological series</a>.</p>
<p style="text-align: left;">
<div id="attachment_563" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html"><img class="size-medium wp-image-563" title="adb-cloud-volumes" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-volumes-300x195.jpg" alt="Word clouds by volume" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by volume</p></div>
<div id="attachment_564" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html"><img class="size-medium wp-image-564" title="adb-cloud-series" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-series-300x195.jpg" alt="Word clouds by series" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by series</p></div>
<p style="text-align: left;">
<p>I even added some simple Javascript slideshows so you could watch the clouds evolve.</p>
<p>One of the most obvious features in the series clouds is the gradual disappearance of &#8216;land&#8217;. It&#8217;s one of the most prominent words in the first series, but gradually fades until it disappears completely in the last.</p>
<p>After this successful foray into the world of word clouds, I began to think about other ways of visualising the ADB&#8217;s content. Many of the articles have portrait images, wouldn&#8217;t it be interesting to use the images themselves as the entry point to the biographical articles?</p>
<p>I&#8217;d already been <a href="http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d">playing with CoolIris</a>, so I decided to harvest all the portrait references and use them to create a 3D wall. The <a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html">result is pretty spectacular</a>.</p>
<div id="attachment_569" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html"><img class="size-medium wp-image-569" title="gallery" src="http://discontents.com.au/wp-content/uploads/2009/01/gallery-300x66.jpg" alt="ADB prtrait browser" width="300" height="66" /></a><p class="wp-caption-text">ADB portrait browser</p></div>
<p>Some technical details about the clouds and the portrait browser follow, for those interested in such things&#8230;</p>
<h3>Gathering your words</h3>
<p>Conveniently<em> </em>for me,<em> ProgrammingHistorian</em> uses the <em>Dictionary of Canadian Biography</em> as its main example, so there was much code that I could <span style="text-decoration: line-through;">just cut and paste</span> carefully examine and utilise.  As the examples show, it&#8217;s easy to grab a webpage and analyse its content on the fly. But I wanted to process more than 10,000 pages and I knew that I was unlikely to get it working the first time round, so I decided to download the files first and then work on them locally. PH provided a basic example, to which I added some error-handling and the necessary loops to cycle through the ADB files. Because I had a bit of inside knowledge I cheated and hard-coded the numbers of articles in each volume. If I hadn&#8217;t known this I would have had to scrape all the browse pages, pulling out the links and creating a list in individual ids – not hard, but a bit tedious. Anyway this is how it ended up:</p>
<pre><pre class="brush: python">
# download_adb.py

import urllib2, time, os, sys
import dh
items = (565, 575, 607, 526, 614, 533, 543, 723, 737, 742, 737, 759, 755, 721, 703, 714, 694, 126)
if os.path.exists(&#039;adb&#039;) == 0: os.mkdir(&#039;adb&#039;)

for v in range(0,18):
    for i in range (1,(items[v]+1)):
        if v == 0:
            filename = &#039;AS1%04db.htm&#039; % i
        else:
            filename = &#039;A%02d%04db.htm&#039; % (v, i)
        if os.path.isfile(&#039;adb/&#039; + filename) == 0:
            print &#039;Processing: &#039; + filename
            url = &#039;http://adbonline.anu.edu.au/biogs/&#039; + filename
            try:
                response = urllib2.urlopen(url)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                html = response.read()
                f = open(&#039;adb/&#039; + filename, &#039;w&#039;)
                f.write(html)
                f.close
                time.sleep(2)
        else:
            print &quot;File already downloaded&quot;
        sys.stdout.flush()</pre></pre>
<h3>Learning to count</h3>
<p>Before too long I had a directory full of about 11,000 little html files just waiting for me to begin my evil experiments. First I had to slice them up and pull out all the interesting bits. By examining the code of the pages I could see that the main content was inside a div with the id of &#8216;content&#8217;. Using the Beautiful Soup Python library, I was easily able to extract this div. But the content div also usually included a portrait image and a bibliography. Once again I dipped into Beautiful Soup to discard all the unwanted bits. The slicing and dicing went something like this:</p>
<pre><pre class="brush: python">
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)</pre></pre>
<p>Now I had the text of the article to play with. Following the PH examples it wasn&#8217;t long before I could extract word-frequency tables from a few files at a time. However, when I tried to process all the articles from a particular volume it took a verrry long time. I fiddled a bit with the code and amazed myself by dramatically improving the performance. I replaced the <em>wordListToFreqDict</em> function provided by PH with my own modified version:</p>
<pre><pre class="brush: python">
def wordListToFreqDict2(wordlist):
    worddict = dict.fromkeys(wordlist)
    wordfreq = [wordlist.count(p) for p in worddict.keys()]
    return dict(zip(worddict,wordfreq))
</pre></pre>
<p>The <code>worddict = dict.fromkeys(wordlist)</code> line made all the difference, creating a list of unique words that could then be checked against the full word list.  With this hack in place I was able to process a complete volume in a few minutes.</p>
<p>I was already using a list of stopwords provided by PH to exclude things such as &#8216;such&#8217; , &#8216;as&#8217; and &#8216;and&#8217;, but obviously a few additions were necessary. To the list of stopwords I added:</p>
<pre><pre class="brush: python">
stopwords += [&#039;january&#039;, &#039;february&#039;, &#039;march&#039;, &#039;april&#039;, &#039;may&#039;, &#039;june&#039;, &#039;july&#039;, &#039;august&#039;, &#039;september&#039;, &#039;october&#039;, &#039;november&#039;, &#039;december&#039;]
stopwords += [&#039;new&#039;, &#039;south&#039;, &#039;wales&#039;, &#039;australia&#039;, &#039;australian&#039;, &#039;victoria&#039;, &#039;south&#039;, &#039;western&#039;, &#039;queensland&#039;, &#039;tasmania&#039;]
#stopwords += [&#039;sydney&#039;, &#039;melbourne&#039;, &#039;brisbane&#039;, &#039;adelaide&#039;, &#039;perth&#039;, &#039;hobart&#039;]
stopwords += [&#039;died&#039;, &#039;born&#039;, &#039;life&#039;, &#039;lived&#039;, &#039;married&#039;, &#039;father&#039;, &#039;wife&#039;, &#039;children&#039;, &#039;son&#039;, &#039;sons&#039;, &#039;daughter&#039;, &#039;daughters&#039;, &#039;brother&#039;, &#039;brothers&#039;]
stopwords += [&#039;street&#039;, &#039;st&#039;, &#039;year&#039;, &#039;years&#039;, &#039;months&#039;, &#039;acre&#039;, &#039;acres&#039;, &#039;ha&#039;]
stopwords += [&#039;e&#039;, &#039;m&#039;, &#039;b&#039;, &#039;c&#039;, &#039;w&#039;, &#039;j&#039;, &#039;d&#039;, &#039;n&#039;, &#039;f&#039;, &#039;g&#039;, &#039;h&#039;, &#039;i&#039;, &#039;ii&#039;, &#039;l&#039;, &#039;o&#039;, &#039;p&#039;, &#039;th&#039;, &#039;r&#039;, &#039;t&#039;, &#039;u&#039;, &#039;r&#039;, &#039;nd&#039;]
</pre></pre>
<p>The first two lines should be pretty obvious. As you can see, I originally excluded names of the capital cities, but then realised that you could watch Sydney and Melbourne battle it out for pre-eminence, so I excluded the exclusion. Also out were family relations and various other words that turned up in almost every article. Cleaning out all the non-alphabetical characters from the text had left a lot of orphaned letters that had once been things like £ signs, so I had to dispose of them as well.</p>
<p>The modules for actually generating the clouds were mostly just copied from PH with a few minor changes. My complete script is here:</p>
<pre><pre class="brush: python">
# adb-text-count.py

import urllib2
import dh, os, sys, time
from BeautifulSoup import BeautifulSoup
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
filelist = dh.getFileNames(dir)

f = open(&#039;wordlist.txt&#039;, &#039;w&#039;)
for file in filelist:
    print &#039;Processing &#039; + file
    sys.stdout.flush()
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)
    text = dh.stripTags(str(content[0]))
    fullwordlist = dh.stripNonAlpha(text.lower())
    wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
    f.write(&quot; &quot;.join(wordlist))
f.close
f = open(&#039;wordlist.txt&#039;)
words = f.read()
f.close
wordlist = words.split(&quot; &quot;)
dictionary = dh.wordListToFreqDict2(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
f = open(&#039;wordfreqs.txt&#039;, &#039;w&#039;)
for s in sorteddict: f.write(str(s)+&quot;\n&quot;)
f.close
print &#039;Dictionary created&#039;
sys.stdout.flush()
# create tag cloud and open in Firefox
cloudsize = 200
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
outstring = &#039;&#039;
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
print &#039;Creating cloud&#039;
sys.stdout.flush()
for k in resorteddict:
    kfreq = k[0]
    klabel = k[1]
    klabel = dh.undecoratedHyperlink(&#039;http://adbonline.anu.edu.au/scripts/adbp-ent_search.php?ranktext=&#039; + k[1] + &#039;&amp;amp;search=Go!&#039;, k[1])
    scalingfactor = (kfreq - minfreq) / float(freqrange)
    outstring += &#039; &#039; + dh.scaledFontSizeSpan(klabel, scalingfactor) + &#039; &#039;
dh.wrapStringInHTML(&quot;html-to-tag-cloud&quot;, dh.defaultCSSDiv(outstring), &quot;Complete&quot;)
finish = time.time()
print &quot;Finished at: &quot;, time.asctime(time.localtime(finish))
print &quot;Total time: &quot;, finish - start
</pre></pre>
<h3>Biographies in 3D</h3>
<p>To display all the portrait images in CoolIris I had to harvest all the image details and then write them to a Media RSS file for CoolIris to read.</p>
<p>Extracting the details of all the thumbnail versions of the portraits in the ADB was easy using Beautiful Soup. But I also need the paths to the larger versions of the portraits stored on the sites of the repositories that hold the originals. All of these sites present the images differently, so a different scraper was needed for each of them. As yet I&#8217;ve only included major libraries and archives – I may add some more if I get the time.</p>
<p>Once the paths to the thumbnails and large versions had been harvested, it was just a matter of writing the RSS feed. Actually, I created a series of RSS files, one for each volume, linked using &#8216;rel=previous&#8217; and &#8216;rel=next&#8217; attributes. This helped speed up the loading of the gallery. For what it&#8217;s worth, the complete code is here:</p>
<pre><pre class="brush: python">
# adb-portraits.py

import socket, urllib2, urllib
import dh, os, sys, time, re
from BeautifulSoup import BeautifulSoup
# timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
for i in range(8,18):
    if (i == 17): vol = &quot;AS1&quot;
    else: vol = &quot;A%02d&quot; % i
    filelist = dh.getFileNamesByVol2(dir, vol)
    f = open(&#039;adb-portraits-%s.rss&#039; % i, &#039;w&#039;)
    f.write(&quot;&lt;?xml version=&#039;1.0&#039; encoding=&#039;utf-8&#039; standalone=&#039;yes&#039;?&gt;\n&quot;)
    f.write(&quot;&lt;rss version=&#039;2.0&#039; xmlns:media=&#039;http://search.yahoo.com/mrss/&#039; xmlns:atom=&#039;http://www.w3.org/2005/Atom&#039;&gt;\n&quot;)
    f.write(&quot;&lt;channel&gt;\n&quot;)
    f.write(&quot;\n&quot;)
    f.write(&quot;&lt;description&gt;Portraits of individuals included in the Australian Dictionary of Biography&lt;/description&gt;\n&quot;)
    f.write (&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au&lt;/link&gt;\n&quot;)
    if (i &gt; 1):
        f.write (&quot;&lt;atom:link rel=&#039;previous&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i-1))
    if (i &lt; 17):
        f.write (&quot;&lt;atom:link rel=&#039;next&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i+1))
    for file in filelist:
        print str(file)
        sys.stdout.flush()
        g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
        html = g.read()
        g.close()
        #print html
        sys.stdout.flush()
        soup = BeautifulSoup(html)
        imagediv = soup.findAll(id=&quot;imagebox&quot;)
        if len(imagediv) &gt; 0 :
            print &quot;Found an image&quot;
            sys.stdout.flush()
            links = imagediv[0].findAll(&#039;a&#039;)
            if len(links) &gt; 1:
                link = urllib.unquote(links[(len(links)-1)][&#039;href&#039;][31:])
            else:
                link = urllib.unquote(links[0][&#039;href&#039;][31:])
            print link
            sys.stdout.flush()
            try:
                response = urllib2.urlopen(link)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                id = str(file)[:7]
                thumbnail = &#039;http://www.adb.online.anu.edu.au&#039; + imagediv[0].img[&#039;src&#039;].lstrip(&#039;.&#039;)
                # print thumbnail
                title = imagediv[0].p.contents[0].split(&#039;,&#039;)[0].strip().replace(&#039; - &#039;, &#039;-&#039;)
                title = title.encode(&#039;utf-8&#039;)
                print &quot;Processing: &quot; + title
                sys.stdout.flush()
                html = response.read()
                imgsoup = BeautifulSoup(html)
                if (link.find(&#039;sl.nsw&#039;) &gt; -1):
                    if (link.find(&#039;ebindshow.pl&#039;) == -1): # Not thumbnail pages - see John Bingle
                        if (html.find(&#039;Higher quality image&#039;) != -1):
                            img = imgsoup.findAll(alt=&quot;Higher quality image&quot;)[0].parent[&#039;href&#039;].split(&#039;?&#039;)[1]
                            #img = imgsoup.td.a[&#039;href&#039;].split(&#039;?&#039;)[1]
                        else:
                            img = imgsoup.table.findAll(&#039;tr&#039;)[2].img[&#039;src&#039;]
                        repository = &quot;State Library of NSW&quot;
                elif (link.find(&#039;slv.vic&#039;) &gt; -1):
                    img = imgsoup.findAll(id=&#039;ImageDisplay&#039;)[0].img[&#039;src&#039;]
                    repository = &quot;State Library of Victoria&quot;
                elif (link.find(&#039;slsa.sa&#039;) &gt; -1):
                    img = imgsoup.findAll(&#039;td&#039;)[1].img[&#039;src&#039;]
                    img = link[:link.rfind(&#039;/&#039;)+1] + img
                    repository = &quot;State Library of SA&quot;
                elif (link.find(&#039;nla.gov&#039;) &gt; -1):
                    img = link + &#039;-v&#039;
                    repository = &quot;National Library of Australia&quot;
                elif (link.find(&#039;naa.gov&#039;) &gt; -1):
                    barcode = link[link.rfind(&#039;=&#039;)+1:]
                    img = &quot;http://naa16.naa.gov.au/rs_images/ShowImage.php?B=%s&amp;#038;T=P&quot; % barcode
                    repository = &quot;National Archives of Australia&quot;
                elif (link.find(&#039;territorystories.nt.gov&#039;) &gt; -1):
                    img = imgsoup.table.img[&#039;src&#039;]
                    repository = &quot;Northern Territory Library&quot;
                elif (link.find(&#039;statelibrary.tas.gov&#039;) &gt; -1):
                    if (html.find(&#039;No matches were found&#039;) == -1):
                        img =imgsoup.blockquote.img[&#039;src&#039;]
                        repository = &quot;State Library of Tasmania&quot;
                elif (link.find(&#039;slq.qld.gov&#039;) &gt; -1):
                    img = imgsoup.findAll(attrs={&quot;class&quot;:&quot;pictureback&quot;})[0].a[&#039;onclick&#039;]
                    #img = img[img.find(&#039;http&#039;):img.find(]
                    img = re.search(&#039;http://[\w\d\/\.]*.jpg&#039;, img).group()
                    repository = &quot;State Library of Queensland&quot;
                if (len(img) &gt; 0):
                    f.write(&quot;&lt;item&gt;\n&quot;)
                    f.write(&quot;&lt;guid isPermaLink=&#039;false&#039;&gt;%s&lt;/guid&gt;\n&quot; % id)
                    f.write(&quot;\n&quot; % (title, repository))
                    f.write(&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au/biogs/%sb.htm&lt;/link&gt;\n&quot; % id)
                    f.write(&quot;&lt;media:thumbnail url=&#039;%s&#039; /&gt;\n&quot; % thumbnail.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;media:content url=&#039;%s&#039; type=&#039;image/jpeg&#039; /&gt;\n&quot; % img.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;/item&gt;\n&quot;)
                    f.flush()
                    print &quot;Success!&quot;
                    sys.stdout.flush()
                img = &quot;&quot;
    f.write(&quot;&lt;/channel&gt;\n&quot;)
    f.write(&quot;&lt;/rss&gt;\n&quot;)
    f.close()
</pre></pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

