<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>discontents &#187; ADB Online</title>
	<atom:link href="http://discontents.com.au/tag/adb-online/feed" rel="self" type="application/rss+xml" />
	<link>http://discontents.com.au</link>
	<description>working for the triumph of content over form, ideas over control, people over systems</description>
	<lastBuildDate>Tue, 24 Jan 2012 20:57:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>ADB DIY RSS</title>
		<link>http://discontents.com.au/shed/hacks/adb-diy-rss</link>
		<comments>http://discontents.com.au/shed/hacks/adb-diy-rss#comments</comments>
		<pubDate>Wed, 04 Feb 2009 06:34:24 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[hacks]]></category>
		<category><![CDATA[ADB Online]]></category>
		<category><![CDATA[birthdays]]></category>
		<category><![CDATA[rss]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=653</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=ADB+DIY+RSS&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2009-02-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/adb-diy-rss&amp;rft.language=English"></span>
So I was thinking, wouldn&#8217;t it be nice if the Australian Dictionary of Biography&#8216;s &#8216;born on this day&#8216; feature could be made available as an RSS feed. Every morning you&#8217;d get a new list of biographies delivered direct to your feed reader. And so&#8230; [sounds of xpath wrangling and PHP coding] here it is. It&#8217;s [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=ADB+DIY+RSS&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=hacks&amp;rft.source=discontents&amp;rft.date=2009-02-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/hacks/adb-diy-rss&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=653"><!-- &nbsp; --></abbr>
<p>So I was thinking, wouldn&#8217;t it be nice if the <em>Australian Dictionary of Biography</em>&#8216;s &#8216;<a href="http://www.adb.online.anu.edu.au/scripts/adbp-births-deaths.php">born on this day</a>&#8216; feature could be made available as an RSS feed. Every morning you&#8217;d get a new list of biographies delivered direct to your feed reader. And so&#8230;</p>
<p>[sounds of xpath wrangling and PHP coding]</p>
<p><a href="http://discontents.com.au/shed/adb/born-rss.php">here it is</a>.</p>
<p>It&#8217;s pretty simple – it harvests all the links of people born on the current day, then loops through the links to gather the first paragraph of each biography. Then it&#8217;s just a matter of writing everything to an RSS file.<span id="more-653"></span></p>
<p>In case you missed it, I also created a <a href="http://discontents.com.au/shed/adb/portraits/adb-portraits-1.rss">Media RSS feed</a> for portrait images used in the ADB. This enables them to be <a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html">viewed in CoolIris</a>.</p>
<p>Code follows&#8230;</p>
<pre><pre class="brush: php">
&lt;?php
function getPage($url, $ch) {
	curl_setopt($ch, CURLOPT_URL,$url);
	$html= curl_exec($ch);
	if (!$html) {
		echo &quot;cURL error number:&quot; .curl_errno($ch);
		echo &quot;cURL error:&quot; . curl_error($ch);
		exit;
	}
	return $html;
}
$url = &quot;http://www.adb.online.anu.edu.au/scripts/adbp-births-deaths.php&quot;;
$userAgent = &#039;Googlebot/2.1 (http://www.googlebot.com/bot.html)&#039;;

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = getPage($url, $ch);

$dom = new DOMDocument();
@$dom-&gt;loadHTML($html);

$xpath = new DOMXPath($dom);
$hrefs = $xpath-&gt;evaluate(&quot;//ul[@class=&#039;pb-results&#039;][1]/li/a&quot;);
$titles = $xpath-&gt;evaluate(&quot;//ul[@class=&#039;pb-results&#039;][1]/li/a/text()&quot;);

echo &quot;&lt;?xml version=&#039;1.0&#039;?&gt;\n&quot;;
echo &quot;&lt;rss version=&#039;2.0&#039;&gt;\n&quot;;
echo &quot;&lt;channel&gt;\n&quot;;
echo &quot;\n&quot;;
echo &quot;
&lt;link&gt;http://www.adb.online.anu.edu.au/scripts/adbp-births-deaths.php&lt;/link&gt;\n&quot;;
echo &quot;&lt;description&gt;A list of all those people in the Australian Dictionary of Biography who were born on this day.&lt;/description&gt;\n&quot;;
for ($i = 0; $i &lt; $hrefs-&gt;length; $i++) {
	$href = $hrefs-&gt;item($i);
	$title = $href-&gt;nodeValue;
	$bio = &quot;&quot;;
	$url = &quot;http://www.adb.online.anu.edu.au&quot; . substr($href-&gt;getAttribute(&#039;href&#039;),2);
	$html = getPage($url, $ch);
	$dom = new DOMDocument();
	@$dom-&gt;loadHTML($html);
	$xpath = new DOMXPath($dom);
	$paras = $xpath-&gt;evaluate(&quot;//div[@id=&#039;content&#039;]/p[1]/text()&quot;);
	foreach ($paras as $para) {
		$bio .= $para-&gt;nodeValue;
	}
	$bio .= &quot;...&quot;;
	$bio = htmlspecialchars($bio, ENT_QUOTES);
	$bio = str_replace(&#039;\n&#039;, &#039;&#039;, $bio);
	echo &quot;&lt;item&gt;\n&quot;;
	echo &quot;\n&quot;;
	echo &quot;
&lt;link&gt;$url&lt;/link&gt;\n&quot;;
	echo &quot;&lt;description&gt;$bio&lt;/description&gt;\n&quot;;
	echo &quot;&lt;/item&gt;\n&quot;;
}
echo &quot;&lt;/channel&gt;\n&quot;;
?&gt;
</pre></pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/hacks/adb-diy-rss/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudy biographies and portrait walls</title>
		<link>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls</link>
		<comments>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls#comments</comments>
		<pubDate>Sat, 24 Jan 2009 08:26:12 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[ADB Online]]></category>
		<category><![CDATA[biographies]]></category>
		<category><![CDATA[Cooliris]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[word clouds]]></category>

		<guid isPermaLink="false">http://discontents.com.au/?p=409</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
With a bit of time to play over Christmas I had a go at applying some of the techniques described at ProgrammingHistorian to the ADB Online.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Cloudy+biographies+and+portrait+walls&amp;rft.aulast=Sherratt&amp;rft.aufirst=Tim&amp;rft.subject=experiments&amp;rft.source=discontents&amp;rft.date=2009-01-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls&amp;rft.language=English"></span>
<abbr class="unapi-id" title="http://discontents.com.au/?p=409"><!-- &nbsp; --></abbr>
<p>With a bit of time to play over Christmas I had a go at applying some of the techniques described at <a href="http://niche.uwo.ca/programming-historian/index.php"><em>ProgrammingHistorian</em></a> to the <a href="http://www.adb.online.anu.edu.au/adbonline.htm">ADB Online</a>.  I thought it might be interesting to create some word clouds, both for what they could reveal about the content of the ADB, and to see what they had to offer as a way of improving access to the articles.</p>
<p>So I set about learning Python and was soon downloading and scraping the more than 10,000 articles that make up the ADB online.</p>
<p>My first tests revealed that the most frequent words in ADB articles were&#8230;</p>
<p style="text-align: center;"><strong>born</strong> and <strong>died</strong></p>
<p style="text-align: left;">Who&#8217;d have thought it? In a biographical dictionary?</p>
<p style="text-align: left;">After further refining the stopwords list I started to generate some useful clouds. Finally after 147 minutes of processing time, I had a <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html">word cloud</a> representing the content of all 16 volumes of the <em>Australian Dictionary of Biography</em>.</p>
<p style="text-align: left;">
<div id="attachment_559" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-complete.html"><img class="size-medium wp-image-559" title="adb-cloud-complete" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-complete-300x195.jpg" alt="The complete ADB word cloud" width="300" height="195" /></a><p class="wp-caption-text">The complete ADB word cloud</p></div>
<p style="text-align: left;"><span id="more-409"></span>The words in the cloud are linked back to the ADB&#8217;s own search engine, allowing the cloud to be used as a way of exploring the articles themselves.</p>
<p style="text-align: left;">It shows the top 200 words, but if you want to see the rest you can download the <a href="http://discontents.com.au/shed/adb/clouds/wordfreqs.txt">raw word frequency file</a> (&gt;1mb txt file).</p>
<p style="text-align: left;">What can you see? Amongst other things, John is obviously the most popular name, Sydney just edges out Melbourne as the most popular place, and burial beats cremation as the most common mode of dispatch. It&#8217;s fun to explore.</p>
<p style="text-align: left;">But of course this then set me wondering about how these frequencies might change with the development of the ADB and changes in its subjects. So I generated word clouds for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html">each volume</a> and for <a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html">each chronological series</a>.</p>
<p style="text-align: left;">
<div id="attachment_563" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-vols.html"><img class="size-medium wp-image-563" title="adb-cloud-volumes" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-volumes-300x195.jpg" alt="Word clouds by volume" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by volume</p></div>
<div id="attachment_564" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/clouds/adb-word-clouds-series.html"><img class="size-medium wp-image-564" title="adb-cloud-series" src="http://discontents.com.au/wp-content/uploads/2009/01/adb-cloud-series-300x195.jpg" alt="Word clouds by series" width="300" height="195" /></a><p class="wp-caption-text">Word clouds by series</p></div>
<p style="text-align: left;">
<p>I even added some simple Javascript slideshows so you could watch the clouds evolve.</p>
<p>One of the most obvious features in the series clouds is the gradual disappearance of &#8216;land&#8217;. It&#8217;s one of the most prominent words in the first series, but gradually fades until it disappears completely in the last.</p>
<p>After this successful foray into the world of word clouds, I began to think about other ways of visualising the ADB&#8217;s content. Many of the articles have portrait images, wouldn&#8217;t it be interesting to use the images themselves as the entry point to the biographical articles?</p>
<p>I&#8217;d already been <a href="http://discontents.com.au/shoebox/archives-shoebox/archives-in-3d">playing with CoolIris</a>, so I decided to harvest all the portrait references and use them to create a 3D wall. The <a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html">result is pretty spectacular</a>.</p>
<div id="attachment_569" class="wp-caption aligncenter" style="width: 310px"><a href="http://discontents.com.au/shed/adb/portraits/adb-portrait-browser.html"><img class="size-medium wp-image-569" title="gallery" src="http://discontents.com.au/wp-content/uploads/2009/01/gallery-300x66.jpg" alt="ADB prtrait browser" width="300" height="66" /></a><p class="wp-caption-text">ADB portrait browser</p></div>
<p>Some technical details about the clouds and the portrait browser follow, for those interested in such things&#8230;</p>
<h3>Gathering your words</h3>
<p>Conveniently<em> </em>for me,<em> ProgrammingHistorian</em> uses the <em>Dictionary of Canadian Biography</em> as its main example, so there was much code that I could <span style="text-decoration: line-through;">just cut and paste</span> carefully examine and utilise.  As the examples show, it&#8217;s easy to grab a webpage and analyse its content on the fly. But I wanted to process more than 10,000 pages and I knew that I was unlikely to get it working the first time round, so I decided to download the files first and then work on them locally. PH provided a basic example, to which I added some error-handling and the necessary loops to cycle through the ADB files. Because I had a bit of inside knowledge I cheated and hard-coded the numbers of articles in each volume. If I hadn&#8217;t known this I would have had to scrape all the browse pages, pulling out the links and creating a list in individual ids – not hard, but a bit tedious. Anyway this is how it ended up:</p>
<pre><pre class="brush: python">
# download_adb.py

import urllib2, time, os, sys
import dh
items = (565, 575, 607, 526, 614, 533, 543, 723, 737, 742, 737, 759, 755, 721, 703, 714, 694, 126)
if os.path.exists(&#039;adb&#039;) == 0: os.mkdir(&#039;adb&#039;)

for v in range(0,18):
    for i in range (1,(items[v]+1)):
        if v == 0:
            filename = &#039;AS1%04db.htm&#039; % i
        else:
            filename = &#039;A%02d%04db.htm&#039; % (v, i)
        if os.path.isfile(&#039;adb/&#039; + filename) == 0:
            print &#039;Processing: &#039; + filename
            url = &#039;http://adbonline.anu.edu.au/biogs/&#039; + filename
            try:
                response = urllib2.urlopen(url)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                html = response.read()
                f = open(&#039;adb/&#039; + filename, &#039;w&#039;)
                f.write(html)
                f.close
                time.sleep(2)
        else:
            print &quot;File already downloaded&quot;
        sys.stdout.flush()</pre></pre>
<h3>Learning to count</h3>
<p>Before too long I had a directory full of about 11,000 little html files just waiting for me to begin my evil experiments. First I had to slice them up and pull out all the interesting bits. By examining the code of the pages I could see that the main content was inside a div with the id of &#8216;content&#8217;. Using the Beautiful Soup Python library, I was easily able to extract this div. But the content div also usually included a portrait image and a bibliography. Once again I dipped into Beautiful Soup to discard all the unwanted bits. The slicing and dicing went something like this:</p>
<pre><pre class="brush: python">
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)</pre></pre>
<p>Now I had the text of the article to play with. Following the PH examples it wasn&#8217;t long before I could extract word-frequency tables from a few files at a time. However, when I tried to process all the articles from a particular volume it took a verrry long time. I fiddled a bit with the code and amazed myself by dramatically improving the performance. I replaced the <em>wordListToFreqDict</em> function provided by PH with my own modified version:</p>
<pre><pre class="brush: python">
def wordListToFreqDict2(wordlist):
    worddict = dict.fromkeys(wordlist)
    wordfreq = [wordlist.count(p) for p in worddict.keys()]
    return dict(zip(worddict,wordfreq))
</pre></pre>
<p>The <code>worddict = dict.fromkeys(wordlist)</code> line made all the difference, creating a list of unique words that could then be checked against the full word list.  With this hack in place I was able to process a complete volume in a few minutes.</p>
<p>I was already using a list of stopwords provided by PH to exclude things such as &#8216;such&#8217; , &#8216;as&#8217; and &#8216;and&#8217;, but obviously a few additions were necessary. To the list of stopwords I added:</p>
<pre><pre class="brush: python">
stopwords += [&#039;january&#039;, &#039;february&#039;, &#039;march&#039;, &#039;april&#039;, &#039;may&#039;, &#039;june&#039;, &#039;july&#039;, &#039;august&#039;, &#039;september&#039;, &#039;october&#039;, &#039;november&#039;, &#039;december&#039;]
stopwords += [&#039;new&#039;, &#039;south&#039;, &#039;wales&#039;, &#039;australia&#039;, &#039;australian&#039;, &#039;victoria&#039;, &#039;south&#039;, &#039;western&#039;, &#039;queensland&#039;, &#039;tasmania&#039;]
#stopwords += [&#039;sydney&#039;, &#039;melbourne&#039;, &#039;brisbane&#039;, &#039;adelaide&#039;, &#039;perth&#039;, &#039;hobart&#039;]
stopwords += [&#039;died&#039;, &#039;born&#039;, &#039;life&#039;, &#039;lived&#039;, &#039;married&#039;, &#039;father&#039;, &#039;wife&#039;, &#039;children&#039;, &#039;son&#039;, &#039;sons&#039;, &#039;daughter&#039;, &#039;daughters&#039;, &#039;brother&#039;, &#039;brothers&#039;]
stopwords += [&#039;street&#039;, &#039;st&#039;, &#039;year&#039;, &#039;years&#039;, &#039;months&#039;, &#039;acre&#039;, &#039;acres&#039;, &#039;ha&#039;]
stopwords += [&#039;e&#039;, &#039;m&#039;, &#039;b&#039;, &#039;c&#039;, &#039;w&#039;, &#039;j&#039;, &#039;d&#039;, &#039;n&#039;, &#039;f&#039;, &#039;g&#039;, &#039;h&#039;, &#039;i&#039;, &#039;ii&#039;, &#039;l&#039;, &#039;o&#039;, &#039;p&#039;, &#039;th&#039;, &#039;r&#039;, &#039;t&#039;, &#039;u&#039;, &#039;r&#039;, &#039;nd&#039;]
</pre></pre>
<p>The first two lines should be pretty obvious. As you can see, I originally excluded names of the capital cities, but then realised that you could watch Sydney and Melbourne battle it out for pre-eminence, so I excluded the exclusion. Also out were family relations and various other words that turned up in almost every article. Cleaning out all the non-alphabetical characters from the text had left a lot of orphaned letters that had once been things like £ signs, so I had to dispose of them as well.</p>
<p>The modules for actually generating the clouds were mostly just copied from PH with a few minor changes. My complete script is here:</p>
<pre><pre class="brush: python">
# adb-text-count.py

import urllib2
import dh, os, sys, time
from BeautifulSoup import BeautifulSoup
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
filelist = dh.getFileNames(dir)

f = open(&#039;wordlist.txt&#039;, &#039;w&#039;)
for file in filelist:
    print &#039;Processing &#039; + file
    sys.stdout.flush()
    g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
    html = g.read()
    g.close()
    soup = BeautifulSoup(html)
    imagediv = soup.findAll(id=&quot;imagebox&quot;)
    if len(imagediv) &gt; 0 :
        imagediv[0].extract()
    heading = soup.findAll(&#039;h4&#039;)
    if len(heading) &gt; 0:
        heading[0].extract()
    footer = soup.findAll(id=&quot;selectbib&quot;)
    paras = footer[0].findNextSiblings(&#039;p&#039;)
    for para in paras:
        para.extract()
    footer[0].extract()
    content = soup.findAll(id=&quot;content&quot;)
    text = dh.stripTags(str(content[0]))
    fullwordlist = dh.stripNonAlpha(text.lower())
    wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
    f.write(&quot; &quot;.join(wordlist))
f.close
f = open(&#039;wordlist.txt&#039;)
words = f.read()
f.close
wordlist = words.split(&quot; &quot;)
dictionary = dh.wordListToFreqDict2(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
f = open(&#039;wordfreqs.txt&#039;, &#039;w&#039;)
for s in sorteddict: f.write(str(s)+&quot;\n&quot;)
f.close
print &#039;Dictionary created&#039;
sys.stdout.flush()
# create tag cloud and open in Firefox
cloudsize = 200
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
outstring = &#039;&#039;
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
print &#039;Creating cloud&#039;
sys.stdout.flush()
for k in resorteddict:
    kfreq = k[0]
    klabel = k[1]
    klabel = dh.undecoratedHyperlink(&#039;http://adbonline.anu.edu.au/scripts/adbp-ent_search.php?ranktext=&#039; + k[1] + &#039;&amp;amp;search=Go!&#039;, k[1])
    scalingfactor = (kfreq - minfreq) / float(freqrange)
    outstring += &#039; &#039; + dh.scaledFontSizeSpan(klabel, scalingfactor) + &#039; &#039;
dh.wrapStringInHTML(&quot;html-to-tag-cloud&quot;, dh.defaultCSSDiv(outstring), &quot;Complete&quot;)
finish = time.time()
print &quot;Finished at: &quot;, time.asctime(time.localtime(finish))
print &quot;Total time: &quot;, finish - start
</pre></pre>
<h3>Biographies in 3D</h3>
<p>To display all the portrait images in CoolIris I had to harvest all the image details and then write them to a Media RSS file for CoolIris to read.</p>
<p>Extracting the details of all the thumbnail versions of the portraits in the ADB was easy using Beautiful Soup. But I also need the paths to the larger versions of the portraits stored on the sites of the repositories that hold the originals. All of these sites present the images differently, so a different scraper was needed for each of them. As yet I&#8217;ve only included major libraries and archives – I may add some more if I get the time.</p>
<p>Once the paths to the thumbnails and large versions had been harvested, it was just a matter of writing the RSS feed. Actually, I created a series of RSS files, one for each volume, linked using &#8216;rel=previous&#8217; and &#8216;rel=next&#8217; attributes. This helped speed up the loading of the gallery. For what it&#8217;s worth, the complete code is here:</p>
<pre><pre class="brush: python">
# adb-portraits.py

import socket, urllib2, urllib
import dh, os, sys, time, re
from BeautifulSoup import BeautifulSoup
# timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
start = time.time()
print &quot;Started at: &quot;, time.asctime(time.localtime(start))
dir = &#039;adb&#039;
for i in range(8,18):
    if (i == 17): vol = &quot;AS1&quot;
    else: vol = &quot;A%02d&quot; % i
    filelist = dh.getFileNamesByVol2(dir, vol)
    f = open(&#039;adb-portraits-%s.rss&#039; % i, &#039;w&#039;)
    f.write(&quot;&lt;?xml version=&#039;1.0&#039; encoding=&#039;utf-8&#039; standalone=&#039;yes&#039;?&gt;\n&quot;)
    f.write(&quot;&lt;rss version=&#039;2.0&#039; xmlns:media=&#039;http://search.yahoo.com/mrss/&#039; xmlns:atom=&#039;http://www.w3.org/2005/Atom&#039;&gt;\n&quot;)
    f.write(&quot;&lt;channel&gt;\n&quot;)
    f.write(&quot;\n&quot;)
    f.write(&quot;&lt;description&gt;Portraits of individuals included in the Australian Dictionary of Biography&lt;/description&gt;\n&quot;)
    f.write (&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au&lt;/link&gt;\n&quot;)
    if (i &gt; 1):
        f.write (&quot;&lt;atom:link rel=&#039;previous&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i-1))
    if (i &lt; 17):
        f.write (&quot;&lt;atom:link rel=&#039;next&#039; href=&#039;adb-portraits-%s.rss&#039; /&gt;&quot; % (i+1))
    for file in filelist:
        print str(file)
        sys.stdout.flush()
        g = open(dir + &#039;/&#039; + file, &#039;r&#039;)
        html = g.read()
        g.close()
        #print html
        sys.stdout.flush()
        soup = BeautifulSoup(html)
        imagediv = soup.findAll(id=&quot;imagebox&quot;)
        if len(imagediv) &gt; 0 :
            print &quot;Found an image&quot;
            sys.stdout.flush()
            links = imagediv[0].findAll(&#039;a&#039;)
            if len(links) &gt; 1:
                link = urllib.unquote(links[(len(links)-1)][&#039;href&#039;][31:])
            else:
                link = urllib.unquote(links[0][&#039;href&#039;][31:])
            print link
            sys.stdout.flush()
            try:
                response = urllib2.urlopen(link)
            except IOError, e:
                if hasattr(e, &#039;reason&#039;):
                    print &#039;We failed to reach a server.&#039;
                    print &#039;Reason: &#039;, e.reason
                elif hasattr(e, &#039;code&#039;):
                    print &#039;The server couldn\&#039;t fulfill the request.&#039;
                    print &#039;Error code: &#039;, e.code
            else:
                id = str(file)[:7]
                thumbnail = &#039;http://www.adb.online.anu.edu.au&#039; + imagediv[0].img[&#039;src&#039;].lstrip(&#039;.&#039;)
                # print thumbnail
                title = imagediv[0].p.contents[0].split(&#039;,&#039;)[0].strip().replace(&#039; - &#039;, &#039;-&#039;)
                title = title.encode(&#039;utf-8&#039;)
                print &quot;Processing: &quot; + title
                sys.stdout.flush()
                html = response.read()
                imgsoup = BeautifulSoup(html)
                if (link.find(&#039;sl.nsw&#039;) &gt; -1):
                    if (link.find(&#039;ebindshow.pl&#039;) == -1): # Not thumbnail pages - see John Bingle
                        if (html.find(&#039;Higher quality image&#039;) != -1):
                            img = imgsoup.findAll(alt=&quot;Higher quality image&quot;)[0].parent[&#039;href&#039;].split(&#039;?&#039;)[1]
                            #img = imgsoup.td.a[&#039;href&#039;].split(&#039;?&#039;)[1]
                        else:
                            img = imgsoup.table.findAll(&#039;tr&#039;)[2].img[&#039;src&#039;]
                        repository = &quot;State Library of NSW&quot;
                elif (link.find(&#039;slv.vic&#039;) &gt; -1):
                    img = imgsoup.findAll(id=&#039;ImageDisplay&#039;)[0].img[&#039;src&#039;]
                    repository = &quot;State Library of Victoria&quot;
                elif (link.find(&#039;slsa.sa&#039;) &gt; -1):
                    img = imgsoup.findAll(&#039;td&#039;)[1].img[&#039;src&#039;]
                    img = link[:link.rfind(&#039;/&#039;)+1] + img
                    repository = &quot;State Library of SA&quot;
                elif (link.find(&#039;nla.gov&#039;) &gt; -1):
                    img = link + &#039;-v&#039;
                    repository = &quot;National Library of Australia&quot;
                elif (link.find(&#039;naa.gov&#039;) &gt; -1):
                    barcode = link[link.rfind(&#039;=&#039;)+1:]
                    img = &quot;http://naa16.naa.gov.au/rs_images/ShowImage.php?B=%s&amp;#038;T=P&quot; % barcode
                    repository = &quot;National Archives of Australia&quot;
                elif (link.find(&#039;territorystories.nt.gov&#039;) &gt; -1):
                    img = imgsoup.table.img[&#039;src&#039;]
                    repository = &quot;Northern Territory Library&quot;
                elif (link.find(&#039;statelibrary.tas.gov&#039;) &gt; -1):
                    if (html.find(&#039;No matches were found&#039;) == -1):
                        img =imgsoup.blockquote.img[&#039;src&#039;]
                        repository = &quot;State Library of Tasmania&quot;
                elif (link.find(&#039;slq.qld.gov&#039;) &gt; -1):
                    img = imgsoup.findAll(attrs={&quot;class&quot;:&quot;pictureback&quot;})[0].a[&#039;onclick&#039;]
                    #img = img[img.find(&#039;http&#039;):img.find(]
                    img = re.search(&#039;http://[\w\d\/\.]*.jpg&#039;, img).group()
                    repository = &quot;State Library of Queensland&quot;
                if (len(img) &gt; 0):
                    f.write(&quot;&lt;item&gt;\n&quot;)
                    f.write(&quot;&lt;guid isPermaLink=&#039;false&#039;&gt;%s&lt;/guid&gt;\n&quot; % id)
                    f.write(&quot;\n&quot; % (title, repository))
                    f.write(&quot;
&lt;link&gt;http://www.adb.online.anu.edu.au/biogs/%sb.htm&lt;/link&gt;\n&quot; % id)
                    f.write(&quot;&lt;media:thumbnail url=&#039;%s&#039; /&gt;\n&quot; % thumbnail.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;media:content url=&#039;%s&#039; type=&#039;image/jpeg&#039; /&gt;\n&quot; % img.replace(&#039;&amp;#038;&#039;,&#039;&amp;amp;&#039;))
                    f.write(&quot;&lt;/item&gt;\n&quot;)
                    f.flush()
                    print &quot;Success!&quot;
                    sys.stdout.flush()
                img = &quot;&quot;
    f.write(&quot;&lt;/channel&gt;\n&quot;)
    f.write(&quot;&lt;/rss&gt;\n&quot;)
    f.close()
</pre></pre>
]]></content:encoded>
			<wfw:commentRss>http://discontents.com.au/shed/experiments/cloudy-biographies-and-portrait-walls/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

