Download PDF

Life on the outside: Collections, contexts, and the wild, wild web

Keynote presented at the Annual Conference of the Japanese Association for the Digital Humanities, 20 September 2014, Tsukuba.

The full set of slides is available on SlideShare.

Cross-published on Medium.


This is Tatsuzo Nakata. In 1913 he was living on Thursday Island in the Torres Strait, just off the northern tip of Australia.

life on the outside.002

From the late 19th century there was a substantial Japanese population on Thursday Island, mostly associated with the development of the pearling industry.

I’ll admit that I know very little about Tatsuzo, and I’ve selected him more or less at random from a large body of records held by the National Archives of Australia.

I present him here out of context and in too little detail, simply as an example. Working backwards from this photograph I want to restore some layers of context and reveal to you a complex and shameful history.

This photograph was attached to an official government form called a ‘Certificate Exempting From Dictation Test’.

From the form we learn that the 32 year-old Tatsuzo was born in Wakayama. He had a scar over his right eye.

life on the outside.004

Tatsuzo carried a copy of this form with him when he departed for Japan aboard the Yawata Maru in May 1913. When he returned the following year the form was collected and compared with a duplicate held by port officials. The forms matched, and Tatsuzo was allowed to disembark.

To help confirm his identity, the form carried on its reverse side an impression of Tatsuzo’s hand.

life on the outside.005

You might think that this was a travel document — an early form of visa perhaps. But at the top of the form you’ll notice a reference to the Immigration Restriction Act, a piece of legislation introduced by the newly-federated Australian nation in 1901. The Immigration Restriction Act and the complex bureaucratic procedures that supported its administration came to be known more generally as the White Australia Policy.

If Tatsuzo had tried to return to Australia without one of these forms, he would have been subjected to the Dictation Test, and he would have failed. Despite its benign-sounding name, the Dictation Test was a form of racial exclusion aimed at anyone deemed non-white. No-one was meant to pass. If he hadn’t carried this form exempting him from the Dictation Test, Tatsuzo would most likely have been denied re-entry.

This certificate is drawn from one of more than 14,000 files in Series J2483 in the National Archives of Australia. This series is solely concerned with the administration of the White Australia Policy. There are many other series from other ports and other time periods full of documents like this. The National Archives holds many, many thousands of these certificates documenting the lives and movements of people considered out of place in a White Australia.

Photographs, forms, files, series, legislation — this small shard of Tatsuzo’s life is preserved as part of a racist system of exclusion and control. But what happens when we extract the photos from their context within the recordkeeping system and simply present them as people?

I’ve created a site where you can explore some of the records relating to Japanese people held in Series J2483. Instead of navigating lists of files, you can start with faces — with the people, not the system.

life on the outside.008

I’m starting today with Tatsuzo and this wall of faces because what I want to explore are some of the complexities of context.

Shark Attack!

After a series of fatal shark attacks in Australian waters, the community of Port Hacking, in southern Sydney, began to wonder if they too were at risk.

In January 2014 the local newspaper published an article under the heading ‘Shark “cover up” in Port Hacking’ alleging that research into the dangers had been suppressed.

Ten days later the newspaper followed up with details of the area’s only recorded fatal shark attack in 1927. A local government member, it reported, had ‘unearthed the article on Trove’.

‘It’s long been a story that a boy was killed by a shark at Grays Point many years ago’, he said, ‘I knew about it 30 to 40 years ago but if you talk to people around here, nobody knows about it’.

‘A lot of people say there are no sharks in Port Hacking but this is rubbish’, he added.

Let me reassure anyone thinking about coming to DH2015 in Sydney next year that shark attacks are extremely rare.

What interested me about these articles was not the risk of gruesome death, but the relationship between past and present. The question of whether shark attacks were possible could be answered — simply by searching Trove.


For those who don’t know, Trove is a discovery service developed and maintained by the National Library of Australia. Like Europeana, the Digital Public Library of America, and DigitalNZ, it aggregates resources from the cultural heritage sector, and beyond.

It also provides access to more than 130 million newspaper articles from 1803 onwards. The articles are drawn from over 600 different titles — large and small, rural and metropolitan — with more are being added all the time.

Search for just about anything and you’re likely to find a match of some sort amongst the digitised newspapers. So of course I searched for Tsukuba

life on the outside.015

Trove is also a community. Users correct the OCR’d text of newspaper articles. They also add thousands of tags and comments to resources across Trove.

  • 138,000 users
  • 3,000,000 tags
  • 139,000,000 corrections
  • 58,000 lists

Perhaps my favourite example of user-generated content on Trove are the Lists. Lists are pretty much what they sound like — collections of resources. They make it easy for you to save and share your research. But more than tags or comments they expose people’s interests and passions. They give some insight into the many acts of meaning-making that occur in and around Trove.

Lists are also exposed through Trove’s Application Programming Interface (API) in a form fit for machine consumption. So with just a dash of code I can harvest the titles of all public lists and do some very basic word frequency analysis courtesy of Voyant Tools.

life on the outside.017

There’s nothing too surprising here — we know that family historians are our largest user group. But we can also see the long tail in action — the way that huge collections like Trove can support very focused, specific interests.

Which leads me back to shark attacks.

Old Speak

The Port Hacking article made me wonder how many other web pages there might be out on the wider web that cited Trove newspapers in a discussion of shark attacks. The answer was many. But what was most interesting wasn’t the volume of references, it was the variety of contexts — in blog posts, on Facebook, in fishing forums.

‘Ahh, old time newspapers are fascinating things aren’t they?’, notes one post in a weather forum, citing details of a shark attack in Sydney from 1952.

On a fishing site, a thread on bull shark attacks in Western Australia’s Swan River begins: ‘I found a great website to view really old newspapers in perth. Just found a few swan river shark storys [sic]…’.

The author follows up with a direct link to the Trove search page, prompting the exchange:

Redfin 4 Life: ‘Haha you would never know there had been that many incedents in the swan without seeing these…’

Goodz: ‘Oh how newspapers have changed the way the write… love the old speak!’

Alan James: ‘That’s right Goodz, and more often than not I’m sure they actually reported the truth.’

So a discussion of shark attacks turns to a consideration of the changing style of newspaper reporting.

Perhaps even more interesting is the way that digitised newspapers are used to test a hypothesis, challenge an interpretation, or argue a case. As in the Port Hacking case, questions about the history of shark attacks can be explored without needing to turn to experts, history books, or official statistics.

So when a local politician is quoted as saying ‘there have not been any serious or fatal shark attacks at Coogee Beach since records commenced in the 1800s’, a reader can respond with two Trove newspaper citations and the comment: ‘No previous shark attacks? Or are they only searching for fatalities?’

When a media outlet asks its Facebook followers whether the export of live sheep from Western Australia might be increasing the number of shark attacks off the coast, one follower can simply share a Trove link to a newspaper article from 1950 and ask ‘Did they have live sheep export in 1950?’

I don’t want to argue that these interactions are particularly profound or remarkable. In fact I’d suggest that they’re interesting because they’re not remarkable. 130 million digitised newspaper articles chronicling 150 years of Australian history are just another resource woven into the fabric of online experience. The past can be mobilised, shared and embedded in our daily interactions as easily as pictures of cats.


And it’s not just shark attacks. To explore the variety of contexts in which Trove newspaper articles are used and shared, I started mining backlinks.

Backlinks, as the name suggests, are just links out there on the wild, wild web that point back to your site. You can find them in your referrer logs, in Google’s webmaster tools, or simply by searching. I started with a ‘try before you buy’ sample of backlinks from an SEO service.

From there I wrote a script to harvest the linking pages, remove duplicates, extract the newspaper references, retrieve the article details from the Trove API, and save everything to a database for easy exploration. You can play with the results online.

life on the outside.025

I ended up harvesting 3116 pages from 1780 domains containing 13,389 links to 11,242 articles in Trove. Remember that’s just a sample of all the links to Trove newspapers out there on the web.

What was more surprising than the raw numbers was the diversity of content across those pages. I knew that family and local historians were busily blogging about their Trove discoveries, but I didn’t know that Trove newspapers were being cited in discussions about politics, science, war, sport, music — just about any topic you could imagine.

Nor are these discussions just about Australia. A little quick and dirty analysis suggests that more than 30 languages are represented across those 3000 pages.

life on the outside.027

This is a work in progress. I hope to expand my hunt for traces — crawling sites for additional references, mining referrals, and inviting the public to nominate pages for inclusion. By adding a simple API I could make it possible for Trove to include links back to relevant pages, like trackbacks on a blog. I also want to understand more about the scope of the content and the motivations of its authors. What is going on here?

Undoubtedly some of these pages constitute link spam or attempts to game search engines, but most do not. Browsing the database you find many examples of interpretation, persistence, and passion. People around the world have something they want to say, something they want to share, and Trove’s millions of newspaper articles provide them with a readily-accessible source of inspiration and evidence.

It’s clear that those many small acts of meaning-making we can observe in Trove’s activity statistics extend beyond a single site — to a much much wider (and wilder) world.


One day earlier this year, Trove received more than three times its usual number of visitors.

life on the outside.029

The culprit was the WTF subreddit — a popular place for sharing the weirdities of the web. Someone posted a link to a Trove newspaper article describing the unfortunate demise of a poodle called Cachi, whose fall from a thirteenth-story balcony in Buenos Aires resulted in the deaths of three passers-by.

As well as causing a dramatic spike in Trove’s visitor stats, the post received more than 3000 votes and attracted 677 comments on reddit. Cachi was a hit.

Trove articles pop up regularly on reddit. The traffic spikes they bring are reminders that however proud we might be of our stats, we are but a tiny corner of the web. There’s something much bigger out there.

Michael Peter Edson has long sought to alert cultural heritage organisations to the challenges of scale. In a recent essay he described the web’s ‘dark matter’:

There’s just an enormous, humongous, gigantic audience out there connected to the Internet that is starving for authenticity, ideas, and meaning. We’re so accustomed to the scale of attention that we get from visitation to bricks-and-mortar buildings that it’s difficult to understand how big the Internet is—and how much attention, curiosity, and creativity a couple of billion people can have.

Libraries, archives and museums, he argues, need to meet the public where they are, to recognise that vigorous sites of meaning-making are scattered across the vast terrain of the web. Trove newspaper traces and reddit spikes are mere glimpses of the ‘dark matter’ of cultural activity that lurks beneath the apps, the stats, and the corporate hype.

People are already using our digital stuff in ways we don’t expect. The question is whether libraries, archives and museums see this hunger for connection as an invitation or a threat. Do we join the party, or call the police to complain about the noise?


There’s something fundamentally human about sharing. Yes, it’s easy to mock the shallowness of a Facebook ‘Like’; to see our obsession with followers, friends and retweets as evidence of our dwindling capacity for attention — reducing engagement and understanding to a single click. But haven’t we always shared — through stories, gossip, jokes, performances, and rituals? Rather than being measured against a threshold of meaning, surely each act of sharing exists on a continuum from the flippant to the philosophical. Just because the act of sharing has been commodified by large social media services seeking to mine our preferences for profit, doesn’t mean it lacks deeper human significance.

A retweet can represent a fleeting interest, a brief moment of distraction. But it can also mark the start of a journey.

Cultural heritage institutions around the world have begun to recognise that sharing is not just a marketing strategy, it’s a mission. As Merete Sanderhoff notes in her foreword to the anthology Sharing is Caring:

When cultural heritage is digital, open and shareable, it becomes common property, something that is right at hand every day. It becomes a part of us.

Aggregation services, like Trove, the Digital Public Library of America, Europeana, and DigitalNZ, bring resources together to share them more easily with the world. Aggregation is only worthwhile if it serves discovery and reuse — it’s a process of mobilisation, rather than collection. As Europeana argues in their 2020 strategy:

We believe culture is a catalyst for social and economic change. But that’s only possible if it’s readily usable and easily accessible for people to build with, build on and share.

Of course the hard part is understanding what makes something ‘readily usable and easily accessible’. What balance do we need between push and pull? Between ease-of-use and technical power? Between licensing and liberty? Between context and creativity?

Busy Bots

The Mechanical Curator was born in the British Library Labs as part of their innovative digital scholarship program. In September 2013, she started posting to Tumblr random images automatically extracted from a collection of 65,000 digitised 19th century books.

It was, Ben O’Steen explained, an experiment in ‘providing undirected engagement with the British Library’s digital content’. The book illustrations moved from inside to outside, opening opportunities for discovery beyond the covers.

But that was just the beginning. A few months later the Mechanical Curator dramatically expanded its labours, uploading more than a million public domain images to Flickr.

What followed was something of a cultural feeding frenzy as people from all over the world starting sharing, tagging, collecting, and creating with this rich assortment of 19th century illustrations. Since then the images have been mashed up into new works, added and organised in the Wikimedia Commons, and featured in an installation at the Burning Man festival in Nevada.

life on the outside.038

Having been locked away within books for more than a hundred years, the illustrations were given new life online as works in their own right. Opportunities for innovation and expression were created by a rupture in context.

Meanwhile on Twitter, a growing army of bots was liberating items from cultural collections around the world. Inspired by the bot-making genius of Mark Sample, I created @TroveNewsBot in June 2013 to tweet newspaper articles from Trove.

He was joined by @DPLABot, @EuropeanaBot, @Kasparbot, @CurtinLibBot,, @museumbot, @cooperhewittbot, @bklynmuseumbot, and no doubt others — all sharing random collection items. Of course @MechCuratorBot soon joined the fray from the British Library, and I eventually added @Trovebot to tweet material from all the non-newspapery sections of Trove.

The possibilities of serendipitous discovery are receiving increasing attention within the digital humanities. At DH2014, Kim Martin and Anabel Quan-Haase critically examined four DH tools — including @TroveNewsBot — in the light of existing models of serendipity. Their discussion noted that randomness is not the same as serendipity, and outlined how serendipity could be understood as type of encounter with information. I do wonder though if what makes the bots interesting is not randomness as such, but the way randomness can play around with our assumptions about context.

Steve Lubar observes that the random offerings of collection bots can also expose the choices that are made in the creation and display of cultural collections. Randomness can challenge our expectations. Describing the genesis of the Mechanical Curator, James Baker notes:

And so as what at first seemed simple descends into complexity the Mechanical Curator achieves her peculiar aim: giving knowledge with one hand, carpet bombing the foundations of that knowledge with the other.

The Trove bots I created do more than tweet random offerings, they also allow you to interact with Trove without ever leaving Twitter. Send a few keywords their way and they’ll do your searching for you, tweeting back the most relevant result. You can modify their default behaviour by adding a series of hashtags — #luckydip, for example, will spice your result with a touch of randomness.

More interestingly, perhaps, you can tweet a url at them and they’ll extract keywords from the web page and use them to construct the search. This means that @TroveNewsBot can offer commentary on current events.

Several times a day he retrieves the latest headlines from a news site and searches for something similar amidst Trove’s 130 million historical newspaper articles. What emerges is a strange conversation between past and present.

life on the outside.041

These bots do not simply present collection items outside of the familiar context of discovery interfaces or online exhibitions, they move the encounter itself into a wholly new space. Just as the Mechanical Curator liberates illustrations from the printed page, the Twitter bots loosen the institutional context of collections to allow them to participate in a space where people already congregate. They send collection items out into the wilds of the web, to find new meanings, new connections and perhaps even new love.

Broken & Repaired

But letting go can be scary. A 2008 survey of libraries, archives and museums revealed that one of the main factors inhibiting the opening up of online collections was the desire to avoid misrepresentation, mislabeling or misuse of cultural objects. Easy sharing brings the risk that our carefully curated content will be shorn of context and bounced around the web — adrift and abused.

Earlier this year Sarah Werner took aim at Twitter feeds that pump out streams of ‘historical’ photos — unattributed and often wrongly captioned. But it wasn’t simply the lack of attribution that angered her:

These accounts capitalize on a notion that history is nothing more than superficial glimpses of some vaguely defined time before ours, one that exists for us to look at and exclaim over and move on from without worrying about what it means and whether it happened.

I have to admit that the excitement of seeing Trove’s visitor numbers suddenly soar thanks to reddit is frequently tempered by the realisation that what is being shared is yet another story of gruesome death, violence, or misfortune. 150 years of Australian history is reduced to clickbait by our tabloid sensibilities. Most of those who arrive from reddit read the article and click away — the bounce rate is around 97%. This is not ‘engagement’?

And yet, I can’t help but wonder about the 3% who don’t immediately leave, who pause and look around. Three percent of a lot is still a lot — a lot of people who might have been exposed to Trove and Australian history for the very first time. Similarly while the viral pics industry is frustrating and exploitative, it might yet offer opportunities to learn.

One of my favourite Twitter accounts is @PicsPedant. It monitors many of the viral pics feeds, researches the images, and tweets the results — providing a steady stream of attributions, corrections, critiques, and context. Not only do you find out about the images, you pick up research tips, and learn about the cannibalistic tendencies of the pic bots themselves — constantly recycling content from their kin.

@AhistoricalPics offers a different form of education, satirising the whole viral pics genre with its fabricated captions, and pricking at our own inclination to believe.

life on the outside.045

Freeing collections opens them to misuse, but it also exposes that misuse to analysis and critique. Contexts can be rediscovered as well as lost, restored as well as broken.

Generous signposts

It’s wonderful to see many Trove newspaper articles shared on Twitter. Unfortunately a significant proportion of these come from climate change deniers, who mine the newspapers for freak weather events and past climatic theories, imagining that such reports undermine current research. This is bad science and bad history. Their efforts are also well-represented in my database of web page citations, along with expressions of hatred and prejudice that I’d prefer to stay submerged. It’s depressing, but it seems inevitable that people will do bad things with your stuff.

In a recent post about the DPLA’s metadata licensing arrangements, Dan Cohen suggested we should look beyond technical and legal controls around online use towards social and ethical guidelines:

The cynics, of course, will say that bad actors will do bad things with all that open data. But here’s the thing about the open web: bad actors will do bad things, regardless… The flip side of worries about bad actors is that we underestimate the number of good actors doing the right thing.

Bad people will do bad things, but by asserting a social and ethical framework for the use of digital cultural collections we strengthen the resolve and commitment of those who want to do right.

Already there are examples in the work of the Local Contexts project which is developing a series of licenses and labels to guide use of traditional knowledge and cultural materials. Similarly, Creative Commons Aotearoa New Zealand have been developing an Indigenous Knowledge Notice to educate the public about what constitutes appropriate use.

We should remember too that footnotes have always been at the heart of an ethical pact. The Australian historian Tom Griffiths has described footnotes as ‘honest expressions of vulnerability’ — ‘generous signposts to anyone who wants to retrace the path and test the insights’. This ‘professional paraphernalia’ has, he argues, grown out of a series of ethical questions:

To whom are we responsible – to the people in our stories, to our sources, to our informants, to our readers and audiences, to the integrity of the past itself? How do we pay our respects, allow for dissent, accommodate complexity, distinguish between our voice and those of our characters?1

Such questions remain crucial as we consider the relationship between cultural collections and their online users. If we expect people to erect ‘generous signposts’ we have to make our stuff easy to find and share. If we want them to consider their responsibility to the past we should focus on providing trust, confidence, and support, not permission.


If my wall of faces seems seems familiar, it might be because a few years ago I created something similar called The Real Face of White Australia.

The two walls use different sets of records, but they were constructed in much the same way: I reverse-engineered the National Archives’ online database, downloaded images of digitised files, and used a facial detection script to identify and extract faces.

The Real Face of White Australia was an experiment, built over the course of a weekend. But its discomfiting power was immediately evident. Where there had been records, there were people — looking at us, challenging us.

My partner Kate Bagnall is a historian of Chinese-Australia and we were working together on a project called Invisible Australians, aimed at liberating the lives of these people from the bureaucracy of the White Australia Policy.

The project was motivated by a strong sense of responsibility — not to the National Archives, not to the records, but to the people themselves.

We often talk about preserving context as if it’s an end in itself; as if context is just a set of attributes to be catalogued and controlled. The exciting, terrifying, wonderful thing about the wild, wild web is how it upsets our notions of relevance and meaning. Historic newspapers can find their way into contemporary debates. Century-old illustrations can be remade as art. Twitter bots can inspire conversations with collections. The people buried inside a recordkeeping system can be brought at last to the surface. Contexts are unstable, shifting. And through that instability we can glimpse other worlds, we can imagine alternatives, we can build something new.

What’s important is not training users to understand the context of our collections, but helping them explore and understand their responsibilities to the pasts those collections represent.

Let’s remove technical barriers, minimise legal restrictions, and trust in the good will of our audiences. Instead of building shrines to our descriptive methodologies, let’s create systems that provide stable shareable anchors, that connect, but don’t constrain.

Contexts will flow and mingle, some will fade and some will burn. Contexts will survive not because we demand it in our terms of service, or embed them in our interfaces, but because they capture something that matters.

The ways we find and use cultural collections will continue to change, but questions about responsibility, value, and meaning will remain.


  1. Tom Griffiths, ‘History and the creative imagination’, History Australia, Vol. 6, No. 3, 2009. []

On seams and edges

On seams and edges

Recently I submitted the abstract below for ALIA Information Online 2015. I haven’t heard yet whether it’s been accepted, but I thought I’d post it here anyway because, even if I don’t get to talk about it at the conference, I want to think about the topic some more. If nothing else, this is an extended NTS…

Many thanks to @edsu and @nowviskie for pointing me towards ideas of ‘repair’ and ‘broken world thinking’, which I reckon will help me develop the arguments I was gesturing towards earlier this year in a talk on The Future of Trove. In that talk I drew on some of my old research on the nature of progress to describe a future for Trove that avoided visions of technological power and sophistication:

The future of Trove shouldn’t be envisaged in terms of slick interfaces and fast search (though I’d like some more of that).

The future of Trove will be messy, it will be complicated, and it will be complicated, because life is just like that, and while Trove is built of metadata, it’s powered by the people that contribute, use, share and annotate that metadata.

Life can also be disappointing, painful and disturbing, and all of that too must figure in the future of Trove.

It’s important to try and see Trove as a series of accommodations, agreements, and annotations, rather than as a big aggregation machine. There’s a fragility in the connections that we make that needs to be understood. There’s no inevitability here, but many acts of goodwill, generosity, and repair.

More to come on this, I hope… (I’m also collecting some relevant bits and pieces in Zotero.)

On seams and edges — dreams of aggregation, access & discovery in a broken world

Visions of technological utopia often portray an increasingly ‘seamless’ world, where technology integrates experience across space and time. Edges are blurred as we move easily between devices and contexts, between the digital and the physical.

But Mark Weiser, one of the pioneers of ubiquitous computing, questioned the idea of seamlessness, arguing instead for ‘beautiful seams’ — exposed edges that encouraged questions and the exploration of connections and meanings.

With discovery services and software vendors still promoting ‘seamless discovery’ as one of their major selling points, it seems the value of seams and edges requires further discussion. As we imagine the future of a service such as Trove, how do we balance the benefits of consistency, coordination and centralisation against the reality of a fragmented, unequal, and fundamentally broken world.

This paper will examine the rhetoric of ‘seamlessness’ in the world of discovery services, focusing in particular on the possibilities and problems facing Trove. By analysing both the literature around discovery, and the data about user behaviours currently available through Trove, I intend to expose the edges of meaning-making and explore the role of technology in both inhibiting and enriching experience.

How does our dream of comprehensiveness mask the biases in our collections? How do new tools for visualisation reinforce the invisibility of the missing and excluded? How do the assumptions of ‘access’ direct attention away from practical barriers to participation?

How does the very idea of systems and services, of complex and powerful ‘machines’ ready to do our bidding, discourage us from seeing the many, fragile acts of collaboration, connection, interpretation, and repair that hold these systems together?

Trove is an aggregator and a community; a collection of metadata and a platform for engagement. But as we imagine its future, how do avoid the rhetoric of technological power, and expose its seams and edges to scrutiny.


Eyes on the past

Eyes on the past

Faces offer an instant connection to history, reminding us that the past is full of people. People like us, but different. People with their own lives and stories. People we might only know through a picture, a few documentary fragments, or a newspaper article.

Eyes on the Past is an experimental interface, built in a weekend. I’m exploring whether faces can provide a way to explore more than 120 million newspaper articles available on Trove.

This collection of tweets tells the story of its development.


There’s some details about the software used in the site’s about page. You can view the harvest/detection and the website code on GitHub.

Easter eggsperiments

No, nothing to do with Easter or eggs, but it’s Easter Sunday and who can resist a good opportunity for a bad pun?

This is another catch-up post, pulling together some recent experiments. If nothing else, it’ll help me keep track of things I’m otherwise likely to forget.

WWI Faces

In our last instalment I was playing around with some WWI data from the State Library of South Australia. I’m really pleased to report that SLSA staff have used my experiments to help them add Trove links to more than 6,000 of their Heroes of the Great War records. Here’s an example — note the ‘article’ link which goes straight to a digitised newspaper article in Trove. With some good data and bit of API wrangling we’ve now established rich linkages between an important WWI resource and Trove. Win!

I’ve also continued my fiddling with articles from the Adelaide Chronicle as I start to think about how Trove’s newspapers might be used in a WWI exhibition being developed by the National Library. At the end of my last post I’d created a list of articles from the Chronicle that were likely to include biographical details of WWI personnel. I knew that many of these included portrait photos, so I filtered them on Trove’s built-in ‘illustrated’ facet and saved the page images for the remaining articles. You can browse the resulting collection of pages on Dropbox. As you can see there are indeed many portraits of service people.

So the next step was to try and extract the portraits from the pages. This was rather familiar territory, as I’d already used a facial detection script to create The Real Face of White Australia. But I wasn’t sure how the pattern recognition software would cope with the lower quality newspaper images. After getting all the necessary libraries installed (the hardest bit of the whole process), I pointed the script at the page images and… it worked!

A small sample of the faces extracted from the Chronicle.

A small sample of the faces extracted from the Chronicle.

From 141 pages I extracted 1,738 images, and most of them were faces. You can browse all 1,738, but be warned, I’ve just dumped them onto a single page and added a bit of Isotope magic — so they’ll take a fair while to load and your browser might object. You’ll also notice that I haven’t tried to filter out photos of non-service people, I just wanted to see if it worked. And it does. Even in this rough form you can sense some of the emotive power. What’s really amazing is the way that even small images of faces in group photographs were identified. All I was aiming for at this stage was a proof of concept — yes, I can extract photos of WWI service people from newspapers. Hmmm…

Trove in space

All the faces above were from one South Australian newspaper. Several years ago I worked on a project to map places of birth and enlistment of WWI service people, and while I have no interest in the national mythologies surrounding WWI, I do still wonder about the local impact of war — all those small communities sending off their sons and daughters…

So I’m wondering whether we might be able to use the digitised newspapers in Trove to navigate from place to face. To choose a town anywhere in Australia, and present photographs of service personnel published in nearby newspapers.

I now know I can extract the photos, but how can we navigate Trove newspapers by location? Time for a new experiment…

The Trove API provides a complete list of digitised newspaper titles. You’ll notice that some of the titles include a place name as part of the summary information in brackets, while many others will include place names in their titles, for example:

  • Illawarra Daily Mercury (Wollongong, NSW : 1950 – 1954)
  • Hawkesbury Herald (Windsor, NSW : 1902 – 1945)
  • Kiama Examiner (NSW : 1858 – 1859)
  • Narromine News and Trangie Advocate (NSW : 1898 – 1955)

I haven’t had much luck getting automated named entity extraction tools to work on short text strings like this, so I decided to roll my own using Geoscience Australia’s Gazetteer of Australia 2012. I opened up the GML file containing all Australian places and saved the populated locations to my own Mongo database. This gave me a handy place name database, complete with geo-locations.

Next I went to work on the newspaper titles. Extracting the places from the summary information was easy because they followed a regular pattern, but finding them in the body of the title was trickier. First I had to exclude those words that were obviously not place names. Aside from the usual stopwords (‘and’ and ‘the’), there are many words that commonly occur in newspaper titles — ‘Herald’, ‘Star’, ‘Chronicle’ etc. To find these words I pulled apart all the titles and calculated the frequency of every word. You can explore the raw results — ‘Advertiser’ (116) wins by a large margin, with ‘Times’ (67) in second place. From these results I could create a list of words that I knew were not places and could safely be ignored.

Then it was just a matter of tokenising the titles (breaking them up into individual words), removing all the stopwords (the standard list and my special list), and then looking up the words in my place name database. I did this in two passes, first as bigrams (pairs of words), and then as single words — this allowed me to find compound place names like ‘North Melbourne’. The Trove API gives you s ‘state’ value for each title, so I could use this in the query to increase accuracy.

If I found a place name, I added the place details, including the latitude and longitude, to the title record from the API and included it in my own newspaper title database.

So I ended up to two databases — one with geolocated places, and another with geolocated newspapers. That meant I could build a simple interface to find newspaper titles by place. It’s nothing fancy — just another proof of concept — but it works pretty well. Just type in a place name and select a state and a query is run against the place name database. If the place is found then the latitude and longitude is fed to the titles database to find the closest newspapers. After removing some duplicates, the 10 nearest newspapers are displayed.

Find Trove newspapers by place

Find Trove newspapers by place

Building some sort of map interface on top of this is pretty trivial. What’s more important is to do some analysis of my place matching to see what I might have missed. But so far so good!

Trove is…

Trove is more than newspapers. This is a message the Trove team tries to emphasise at every opportunity. The digitised newspapers are an incredible resource of course, but there’s so much other interesting stuff to explore.

To try and give a quick and easy introduction to this richness, I created a simple dashboard-type view of Trove, imaginatively titled Trove is…

What is Trove?

What is Trove?

Trove is… gives a basic status report on each of the 1o Trove zones, with statistics updated daily (except for the archived websites as there’s no API access at the moment). The BIG NUMBERS are counter-balanced by a single randomly-selected example from each zone. It’s a summary, an overview, a portal and a snapshot. Reload the page and the zones will be reordered and the examples will change.

It’s pretty simple, but I think it works quite well, and thanks to Twitter Bootstrap it looks really nice on my phone! But while the idea was simple, the implementation was pretty tricky — particularly the balance between randomness and performance. If all the examples were truly random, drawn from the complete holdings to Trove on every page reload, you’d spend a lot of time watching spinning arrows waiting for content to appear. I tried a number of different approaches and finally settled on a system where random selections of 100 resources per zone are made every hour by background processes and cached. When you load the page, this cache is queried and an item selected. So if you keep hitting reload you’ll probably notice that some examples reappear. It’s random, but at any moment the pool of possibilities is quite limited. Come back later in the day and everything will be different.

Anyway, if anyone asks you what Trove is, you now know where to point them…

Who listens to the radio?

After a lot of hard work, the Trove team was excited to announce recently that more than 200,000 records from 54 ABC Radio National programs were available through Trove.

To make it a bit easier to explore this wonderful new content, I created a simple search interface. All it really does is help you build a query using the RN program titles, and then sends the query off to Trove. Not fancy, but useful (my family motto).

Of course, I couldn’t leave my Twitter bot family out of the action. @TroveBot has been Radio National enabled. Just tweet the tag #abcrn at him to receive a randomly-selected Radio National story. To search for something amidst the RN records, just tweet a keyword or two and add the #abcrn tag to limit the results. Consult the TroveBot manual for complete operating instructions.

In a word…

But the Radio National content is not just findable through the Trove web interface — all that lovely data is freely accessible through the Trove API. That includes just about every segment of every edition of the ABC’s flagship current affairs programs, AM, PM, and The World Today from 1999 onwards. What sort of questions could you ask of this data?

I’ll be writing something soon on the Trove blog about accessing these riches, but I couldn’t resist having a play. So I harvested all the RN data via the API and built a new thing…

What's in a word?

What’s in a word?

It’s called In a word: Currents in Australian affairs, 2003–2013, and for once it’s quite well documented, so I won’t go into details here. I’ll just say that it’s one of my favourite creations, and I hope you find it interesting.

Addendum (21 April) — The Tung Wah Newspaper Index

See, I told you I forget things…

I recently finished resurrecting the Tung Wah Newspaper Index. Kate has described the original project on her blog, and there’s a fair bit of contextual information on the site, so I won’t go into details here. Suffice it to say it’s an important resource for Chinese Australian history that had succumbed to technological decay.

The original FileMaker database has been MySqld, Solrised, and Bootstrapped to get it all working nicely. I also took the opportunity to introduce a bit of LOD love, with plenty of machine-readable data built-in.

The whole site follows an interface as API type pattern. So if you want a resource as JSON-LD, you just change the file extension to .json. To help you out, there are links at the bottom of each page to the various serialisations, and of course you can also use content negotiation to get what you’re after. There’s some examples of all this in the GitHub repository, as well as a CSV dump of the whole database.


Enriching WWI data with the Trove API

I can’t resist a challenge, particularly when it involves lots of new historical data and an excuse to muck around with the Trove API. So when Katie Hannan from the State Library of South Australia asked me about putting the API to work to enrich one of their World War I datasets, I had to dive in and have a play.

The dataset consists of references to South Australian WWI service personnel published in the Adelaide Chronicle between 1914 and 1919. In a massive effort starting back in 2000, SLSA staff manually extracted more than 13,000 references and grouped them under 9709 headings, mostly names. You can explore the data in the SLSA catalogue as part of the Heroes of the Great War collection.

It’s great data, but it would be even better if there was a direct link to each article in Trove — hence Katie’s interest in the possibilities of the API!

Chronicle (Adelaide, SA : 1895 - 1954) 14 Sep 1918, p. 24,

Chronicle (Adelaide, SA : 1895 – 1954) 14 Sep 1918, p. 24,

Katie sent me a spreadsheet containing the data. Each row corresponds to an individual entry and includes an identifier, a name, a year, and a list of references separated by semicolons. My plan was simple, for each row I’d construct a search based on the name, then loop through the search results to try find an article that matched the date and page number of each reference. This might seem a bit cumbersome, but currently there’s no way of searching Trove for newspaper articles published on a particular day.

You’ll find all of the code on GitHub. I’ve tried to include plenty of comments to make it easy to follow along.

Let’s look at the entry for Lieutenant Frank Rosevear. It includes the following references: ‘Chronicle, 7 September 1918, p. 27, col. c;Chronicle, 14 September 1918, p. 24, col. a p.38, col. d’. If you look closely, you’ll see that there’s two page numbers for 14 September 1918, so there’s actually three references included in this string. The first thing I had to do was to pull out all the references and format them in a standard way.

Assuming that the last name was the surname, I then constructed a query that searched for an exact match of the surname together with at least one of the other names. In Lieutenant Rosevear’s case the query would’ve been ‘fulltext:”Rosevear” AND (“Lieutenant” OR “Frank”)’. Note the use of the ‘fulltext’ modifier to indicate an exact match. To this query I added a date filter to limit the search to the specified year and an ‘l-title’ value to search only the Adelaide Chronicle.

You can see the results for this query in my Trove API console. Try modifying the query string to see what difference it makes.

Once the results came back from the API I compared them to the references, looking for matches on both the date and page number. You might notice that the second result from the API query, dated 7 September 1918, is a match for one of our references. Yay! This gets saved to a list of strong matches. But what about the other references?

Just in case there’s been a transcription error, or the page numbering differed across editions,  I relax the rules a bit in a second pass and accept matches on the date, but not the page. These are saved to a list of close matches.

This second pass doesn’t help much with  Lieutenant Rosevear’s missing references, so we have to broaden our search query a bit. This time we search on the surname only. Bingo! The first result is a match and points us to one of the ‘Heroes of the Great War’ series.

'HEROES OF THE GREAT WAR: THEY GAVE THEIR LIVES FOR KING AND COUNTRY.', Chronicle (Adelaide, SA : 1895 - 1954) 14 Sep 1918, p. 24,

‘HEROES OF THE GREAT WAR: THEY GAVE THEIR LIVES FOR KING AND COUNTRY.’, Chronicle (Adelaide, SA : 1895 – 1954) 14 Sep 1918, p. 24,

It took a while, but my script eventually worked it’s way through all 9709 entries like this, writing the results out to csv files containing the strong and close matches. It also created a summary for each entry, listing the original number of references alongside the number of strong and close matches.

Ever since I read Trevor Munoz’s post on using Pandas with data from the NYPL’s What’s on the Menu? project, I’ve wanted to have a play with it. So I decided to use Pandas to produce some quick stats from my results file.

>>> import pandas as pd
>>> df = pd.read_csv('data/slsa_results.csv')
# How many entries?
>>> len(df)
# How many references?
>>> df['references'].sum()
# How many entries had strong matches?
>>>len(df[df.strong > 0])
# As a percentage thank you...
>>> 100 * len(df[df.strong > 0]) / len(df)
# In how many entries did the number of refs = the number of strong matches
>>> len(df[df.references == df.strong])
# As a percentage thank you...
>>> 100 * len(df[df.references == df.strong]) / len(df)
# How many entries had at least one strong or close match?
>>> len(df[ > 0])
# As a percentage thank you...
>>> 100 * len(df[df.strong > 0]) / len(df)

Not bad. The number of strong matches equalled the number of references in 45% of cases, and overall 66% of entries had a least one strong match. I might be able to get those numbers up by tweaking the search query a bit, but of course the main limiting factor is the quality of the OCR. If the article text isn’t good enough we’re never going to find the names we’re after.

Katie tells me that the State Library intends to point volunteer text correctors towards identified articles. As the correctors start to clean things up, we should be able to find more matches simply by re-running this script at regular intervals.

But what articles should they point the volunteers to? Many of them included the title ‘Heroes of the Great War’, so they’re easy to find, but there were others as well. By analysing the matches we’ve already found we can pull out the most frequent titles and build a list of likely candidates. Something like this:

title_phrases = [
'heroes of the great war they gave their lives for king and country',
'australian soldiers died for their country',
'casualty list south australia killed in action',
'on active service',
'honoring soldiers',
'military honors australians honored',
'casualty lists south australian losses list killed in action',
'australian soldiers died for his country',
'died for their country',
'australian soldiers died for the country',
'australian soldiers died for their country photographs of soldiers',
'quality lists south australian losses list killed in action',
'list killed in action',
'answered the call enlistments',
'gallant south australians how they won their honors',
'casualty list south australia died of wounds',

Now we can feed these phrases into a series of API queries and automatically generate a list of articles that are likely to contain details of WWI service people. This list should provide a useful starting point for keen text correctors.

I might not have completely solved Katie’s problem, but I think I’ve shown that the Trove API can be usefully called into action for these sorts of projects. Taking this approach should certainly save a lot of manual searching, clicking, cutting and pasting. And while I’ve focused on the South Australian data, there’s no reason why similar approaches couldn’t be applied to other WWI projects.

An addition to the family

What’s the collective noun for a group of Twitter bots? Inspiration is failing me at the moment, so let’s just say that the Trove bot family recently welcomed a new member — @TroveBot.

Proof cover for Rogue Robot, Thrills Incorporated pulp series, 1951. By Belli Luigi.

Proof cover for Rogue Robot, Thrills Incorporated pulp series, 1951. By Belli Luigi.

@TroveBot is a sibling of @TroveNewsBot, who’s been tweeting away since June last year. But while @TroveNewsBot draws his inspiration from 120+ million historical newspaper articles, @TroveBot digs away in the millions of books, theses, pictures, articles, maps and archives that make up the rest of Trove. Trove, as we always like to remind people, is not just newspapers.

Like @TroveNewsBot, the newcomer tweets random finds at regular intervals during the day. But both bots also respond to queries. Just tweet a few keywords at them and they’ll have a poke around Trove and reply with something that seems relevant. There’s various ways of modifying this basic search, as explained on the GitHub pages of TroveNewsBot and TroveBot.

@TroveBot’s behaviour is a little more complex because of Trove’s zone structure. The zones bring together resources with a similar format. If you want to get a more detailed idea of what’s in them, you can play around with my Zone Explorer (an experiment for every occasion!). If you just tweet some keywords at @TroveBot he’ll search for them across all the zones, then choose one of the matching zones at random. If you want to limit your search to a particular zone or format, just add one of the format tags listed on the GitHub site.

Let’s say you want a book about robots, just tweet: ‘robots #book’. Or a photos of pelicans, try ‘pelican #photo’.  It’s really that easy.

But you don’t just have to use keywords, you can also feed @TroveBot a url. Perhaps you want to find a thesis in Trove that is related to a Wikipedia page — just tweet the url together with the tag ‘#thesis’. Yes, really.

Behind the scenes @TroveBot makes use of AlchemyAPI to extract keywords from the url you supply. These keywords are then bundled up and shipped off to Trove for a response.

You probably know that @TroveNewsBot is similarly url-enabled. This allows him to do things like respond to the tweets of his friends @DigitalNZBot and @DPLABot, and offer regular commentary on the ABC News headlines.

So what would happen if @TroveBot and @TroveNewsBot started exchanging links? Would they ever stop? Would they break the internet?

With the Australian Tennis Open coming to a climax over the last few days I thought it was a suitable time to set up a game of bot tennis. To avoid the possibility of internet implosion, I decided to act as intermediary, forwarding the urls from one to the other. The rules were simple — the point was lost if a bot failed to respond or repeated a link from earlier in the match.

As a sporting spectacle the inaugural game wasn’t exactly scintillating. In fact, the first game is still locked at 40-40. But it’s been interesting to see the connections that were made. It’s also been a useful opportunity for me to find bugs and tweak their behaviours. You can catch up with the full, gripping coverage via Storify.

After a few more experiments I’m thinking I might try and set up a permanent, endless conversation between them. It would be fascinating to see where they’d end up over time — the links they’d find, the leaps they’d make.

Hmm, collective nouns… What about a serendipity of bots?

Have collection, will travel

A few years ago it seemed fashionable for cultural institutions to create a ‘My [insert institution name]’ space on their website where visitors could create their own online exhibits from collection materials. It bothered me at the time because it seemed to be a case of creating silos within silos. What could people do with their collections once they’d assembled them?

I was reminded of this recently as I undertook my Christmas-break mini project to think about Trove integration with Zotero. Some years ago I created a Zotero translator for the Trove newspaper zone, but I’d much rather we just exposed metadata within Trove pages that Zotero (and other tools like and Mendeley) could pick up without any special code. More about that soon…

However. embedded metadata only addresses part of the problem — there’s also questions around tools and workflows. Trove includes a number of simple tools that enable users to annotate and organise resources — tags, comments, and lists. Tags and comments… well you know what they are. Lists are just collections of resources, and like tags, they can be public or private.

They may not be terribly exciting tools, but they are very heavily used. More than 2 million tags have been added by Trove users, but it’s lists that have shown the most growth in recent times. There are currently more than 47,000 lists and 30,000 of those are public. That’s a pretty impressive exercise in collection building. What are the lists about? A few months ago I harvested the titles of all public lists and threw them into Voyant for a quick word frequency check.

Word frequencies in the titles of Trove lists

Word frequencies in the titles of Trove lists

Given what we know about Trove users, it wasn’t surprising to see that many of the lists related to family history, but there are a few wonderful oddities buried in there as well. I love to be surprised by the passions of Trove users.

I suspect these tools are popular because they’re simple and open-ended. There are few constraints on what you can use for a tag or add to a list. Following some threads about game design recently I came upon a discussion of ‘underspecified’ tools — ‘the use of which you can never fully predict’. By underspecifying we leave open possibilities for innovation, experimentation and play. It seems like a pretty good design approach for the cultural heritage sector.

But wait a minute, you might be wondering, what sort of Trove Manager magic did I have to weave in order to extract those thousands of list titles? None, none at all. You could do exactly the same thing.

I’ve been talking a lot in recent months about Trove as a platform rather than a website — something to build on. One of our main construction tools is, of course, the Trove API. I suppose a good cultural heritage API is also underspecified — focused enough to be useful, but fuzzy enough to encourage a bit of screwing around. What you may not know about the Trove API is that as well as giving you access to around 300 million resources, it lets you access user comments, tags and lists.

I’m looking forward to researchers using the API to explore the various modes of meaning-making that occur around resources in Trove. But right now one thing it offers is portability — the collections people make can be moved. And that brings us back to Zotero.

Why should a Trove user have to decide up front whether they want to use Zotero or create a Trove list? Pursuing Europeana’s exciting vision of putting collections in our workflows we need to recognise that workflows change, projects grow, and new uses emerge. We should support and encourage this by making it as easy as possible for people to move their stuff around.

So of course I had to build something.

My Christmas project has resulted in some Python code that lets you export a Trove list or tag to a Zotero collection — API to API. Again it’s a simple idea, but I think it opens up some interesting possibilities for things like collaborative tagging projects — with a few lines of code hundreds of tagged items could be saved to Zotero for further organisation or annotation.

Along the way I ended up starting a more general Trove-Python library — it’s very incomplete, but it might be useful to someone. It’s all on GitHub — shared, as usual, not because I think the code is very good, but because I think it’s really important to share examples of what’s possible. Hopefully someone will find a slender spark of inspiration in my crappy code and build something better. Needless to say, this isn’t an official Trove product.

So what do you do if you want to export a list or tag?

First of all get yourself Python 2.7 and set up a virtualenv where you can play without messing anything up. Then install my library…

git clone
cd trove-python
python install

You’ll also need to install PyZotero. Once that’s done you can fire up Python and export a list from the command line like this…

from pyzotero import zotero
from trove_python.trove_core import trove
from trove_python.trove_zotero import export

zotero_api = zotero.Zotero('[Your Zotero user id]', 'user', '[Your Zotero API key]')
trove_api = trove.Trove('[Your Trove API key]')

export.export_list(list_id='[Your Trovelist id]', zotero_api=zotero_api, trove_api=trove_api)

Obviously you’ll also need to get yourself a Trove API key, and a Zotero key for your user account.

Exporting items with a particular tag is just as easy…

from pyzotero import zotero
from trove_python.trove_core import trove
from trove_python.trove_zotero import export

zotero_api = zotero.Zotero('[Your Zotero user id]', 'user', '[Your Zotero API key]')
trove_api = trove.Trove(['Your Trove API key'])
exporter = export.TagExporter(trove_api, zotero_api)

exporter.export('[Your tag]')

What do you end up with? Here’s my test list on Trove and the resulting Zotero collection. Here’s a set of resources tagged with ‘inigo’, and here’s the collection I created from them. You’ll notice that I added a few little extras, like attaching pdf copies where they’re available.

Sorry, no GUIs, no support, and not much documentation. Just a bit of rough code and some ideas to play with.

8 months on

This has been a rather lean year on the blogging front. So as 2013 nears its end, I thought I should at least try to list a few recent talks and experiments.

Things changed a bit this year. No more am I the freelance troublemaker, coding in lonely seclusion, contemplating the mysteries of cashflow. Reader, I got a job.

And not just any old job. In May I started work at the National Library of Australia as the Manager of Trove.

Trove, of course, has featured prominently here. I’ve screen-scraped, harvested, graphed and analysed it — I even built an ‘unofficial’ API. Last year the NLA rewarded my tinkering with a Harold White Fellowship. This year they gave me the keys and let me sit behind the wheel. Now Trove is not only my obsession, it’s my responsibility.

Trove is a team effort, and soon you’ll be meeting more of the people that keep it running through our new blog. I manage the awesome Trove Support Team. We’re the frontline troops — working with users, content partners and developers, and generally keeping an eye on things.

And so my working hours are consumed by matters managerial — attending meetings, writing reports, planning plans and answering emails. But, when exhaustion allows, I return to the old WraggeLabs shed on weekends and evenings and the tinkering continues…


TroveNewsBot is a Twitter bot whose birth is chronicled in TroveNewsBot: The story so far. Several times a day he posts a recently-updated newspaper article from Trove. But he also responds to your queries — just tweet some keywords at him and he’ll reply with the closest match. You can read the docs for hints on modifying your query.

TroveNewsBot also offers comment on web pages. Tweet him a url and he’ll analyse its content and search for something relevant amidst his database of more than 100 million newspaper articles. Every few hours he automatically checks the ABC News Just In page for the latest headlines and offers a historical counterpoint.

In Conversations between collections you can read the disturbing story of how TroveNewsBot began to converse with his fellow collection bots, DPLABot and now DigitalNZBot. The rise of the bots has begun…

I should say something more serious here about the importance of mobilising our collections — of taking them into the spaces where people already are. But I think that might have to wait for another day.

Build-a-bot workshop

You can never have too many bots. Trove includes the collections of many individual libraries, archives and museums — conveniently aggregated for your searching pleasure. So why shouldn’t each of these collections have its own bot?

It didn’t take much work to clean up TroveNewsBot’s code and package it up as the Build-a-bot workshop. There any Trove contributor can find instructions for creating their own code-creature, tweeting their resources to the world.

So far Kasparbot (National Museum of Australia) and CurtinLibBot (Curtin University Library) have joined the march of the bots. Hopefully more will follow!

TroveNewsBot Selects

Inspired by the British Library’s Mechanical Curator, TroveNewsBot decided to widen his field of operations to include Tumblr. There at TroveNewsBot Selects he posts a new random newspaper illustration every few hours.

Screen Shot 2013-12-23 at 9.49.32 pm


Unfortunately being newly-employed meant that I had to give up my place at One Week | One Tool. The team created Serendip-o-matic, a web tool for serendipitous searching that used the DPLA, Europeana and Flickr APIs. But while I missed all the fun, I could at least jump in with a little code. Within a day of its launch, Serendip-o-matic was also searching Trove.

Research Trends

This was a quick hack for my presentation at eResearch2013 — I basically just took the QueryPic code and rewired it to search across Australian theses in Trove. What I ended up with was a simple way of exploring research trends in Australia from the 1950s.

'history AND identity' vs 'history AND class'

‘history AND identity’ vs ‘history AND class’

Some of the thesis metadata is a bit dodgy (we’re looking into it!) so I wouldn’t want to draw any serious conclusions, but I think it does suggest some interesting possibilities.

Trove API Console

As a Trove API user I’ve always been a bit frustrated about the inability to share live examples because of the need for a unique, private key. Europeana has a great API Console that lets you explore the output of API requests, so I thought I’d create something similar.

My Trove API Console is very simple at the moment. You just feed it API requests (no key required) and it will display nicely-formatted responses. You can also pass the API request as query parameter to the console, which means you can create easily shareable examples. Here’s a request for wragge AND weather in the newspapers zone.

This is also my first app hosted on Heroku. Building and deploying with Flask and Heroku was intoxicatingly easy.

Trove Zone Explorer

Yep, I finally got around to playing with d3. Nothing fancy, but once I’d figured out how to transform the faceted format data from Trove into the structure used by many of the d3 examples I could easily create a basic treemap and sunburst.

Screen Shot 2013-12-23 at 11.12.04 pm

The sunburst visualisation was pretty nice and I thought it might make a useful tool for exploring the contents of Trove’s various zones. After a bit more fiddling I created a zoomable version that automatically loads a random sample of resources whenever you click on one of the outer leaves — the Trove Zone Explorer was born.

Trove Collection Profiler

As mentioned above, Trove is made up of collections from many different contributors. For my talk at the Libraries Australia Forum I thought I’d make a tool that let you explore these collections as they appear within Trove.

The Trove Collection Profiler does that, and a bit more. Using filters you define a collection by specifying contributors, keywords, or a date range. You can then explore how that collection is distributed across the Trove zones — viewing the results over time as a graph, or drilling down through format types using another zoomable sunburst visualisation. As a bonus you get shareable urls to pass around your profiles.

The latest sunburst-enabled version is fresh out of the shed and badly in need of documentation. I’m thinking of creating embeddable versions, so that institutions can’t create visualisations of their own collections and include them in their sites.


Somewhere in amongst the managering and the tinkering I gave a few presentations:

Conversations with collections

Notes from a talk I gave at the Digital Treasures Symposium, 21 June 2013, University of Canberra.

Over the last couple of weekends I’ve been building a bot. Let me introduce you to the TroveNewsBot.

Screen Shot 2013-06-20 at 7.40.35 PM

TroveNewsBot is just a simple script that periodically checks for messages from Twitter, uses those messages to create queries in Trove’s newspaper database, and tweets back the results.

TroveNewsBot’s birth was, however, not without some pain. I ran into difficulty with Twitter’s automated spam police. At one stage everytime my bot tweeted, its Twitter account was suspended.

Twitter’s bots didn’t like my bot. [:(]

The problem has since been resolved — I think I must have done something when I was testing that upset the spam bots — but it did lead me to read in detail Twitter’s policies on spam and automation. This sentence in particular caused me to reflect:

The @reply and Mention functions are intended to make communication between users easier, and automating these processes in order to reach many users is considered an abuse of the feature.

So what is a user and what is communication? I read this sentence as suggesting that communications between individual human users were somehow more real, more authentic than automatically generated replies. But is a script tweeting someone a link to to a newspaper article that they might be interested in really less authentic than a lot of the human-generated traffic on the net?

Amongst the messages I received when I revealed TroveNewsBot to the world earlier this week was this:

And later from the same person:


Even as we live an increasing amount of our lives ‘connected’, still there remains a tendency to assume that experiences mediated through online technologies are somehow less authentic than those that take place in this space that we often refer to as ‘the real world’.

In the realm of cultural heritage, digitisation is frequently assumed to be a process of loss. We create surrogates, or derivatives — useful, but somehow inferior representations of ‘the real thing’.

Now let’s just all admit that, yes, we like the smell of old books, and that we can’t read in the bath with our iPad, and move beyond the sort of fetishism that often accompanies these sorts of discussions.

Yes, of course, digital and physical manifestations are different, the point is whether we get anywhere by arguing that one is necessarily inferior to the other.

A recent article in the Times Literary Supplement expressed concern at the money being spent on the manuscript digitisation programs that the author argued were ‘proceeding unchecked and unfocused, deflecting students into a virtual world and leaving them unequipped to deal responsibly with real rare materials’. Yes, there may be aspects of a physical page that a digital copy cannot represent, but as Alistair Dunning pointed out in response to the article, there’s no simple binary opposition:

The digital does not replace the analogue, but augments it, sometimes in dramatic and sometimes in subtle ways.

In his keynote address to the ‘Digital Transformers’ conference, Jim Mussell similarly argued against a simplistic understanding of digital deficiencies.

The key is to reconceive loss as difference and use the way the transformed object differs to reimagine what it actually was. Critical encounters with digitized objects make us rethink what we thought we knew.

I’m very pleased to be an adjunct here at the University of Canberra, but I’ve always felt a bit of a fraud around people like Mitchell when it comes to talking about visualisation. I’m actually much more comfortable with words than pictures. So why am I here talking to you today?

I think it’s because what we’re discussing today, what the Digital Treasures Program is about, is not just visualisation. It’s about transformation. It’s about taking cultural heritage collections and changing them. Changing what we can do with them. Changing how we see them. Changing how we think about them.

It’s about creating spaces within which we can have ‘critical encounters with digitized objects’ that ‘make us rethink what we thought we knew’.

And that to me is very exciting.

What might these transformations look like? Who knows? This is research, it should take us places we don’t expect and can’t predict.

However, for the sake of convenience today I’ve tried to define a few possible categories — most, admittedly, based on my own work. But I do so in the hope that the achievements of the Digital Treasures program will soon make my categories look ridiculously inadequate.


When we have stuff in digital form — and by stuff I mean both collection metadata and digital objects — we can isolate particular characteristics and add them up, compare them, graph them. We can start to see patterns that we couldn’t see before.


Putting a lot of similar things together in a way that enables us to see them differently.


Putting different things together in a way that enables us to find connections or similarities.


Displaying something unexpected or random.


Putting things in new contexts, new conceptual spaces, new physical spaces, new geospatial spaces. Creating interventions and explorations.

This is a very limited catalogue of possibilities, but meagre as it is I think its enough to demonstrate that the overwhelming feature of digital cultural collections is not loss or deficiency, but opportunity and inspiration.

In fact I’m less worried about the deficiencies of digital representation than I am about the possibility that we might end up doing too much — that we might become so skilled in design and transformation that we end up overdetermining the experience of our users, that we end up doing too much of the thinking for them.

It seems to me that when it comes to digital cultural collections an important part of the transformation process is knowing where to leave the gaps and spaces that invite feeling, reflection and critique. We have to find ways of representing what is missing, of acknowledging absence and exclusion. We have to be able to expose our arguments and assumptions, to be honest about our failures and limitations. We have to be prepared to leave a few raw edges, some loose threads that encourage users to unravel our carefully-woven tapestries.

As I was developing the TroveNewsBot I realised I needed some sort of avatar. So of course I started searching in the Trove newspaper database for robots — there I found Mr George Robot.

The Courier-Mail, 7 November 1935, page 21

The Courier-Mail, 7 November 1935, page 21

George Robot, ‘described as the greatest electro-mechanical achievement of the age’ toured Australia in 1935 and 1936. As one newspaper described, he:

rises and sits as requested, talks, sings, delivers an address on the most abstruse topics, gnashes his electric teeth in rage or derision, while he accentuates his remarks by the most natural movements of arms and hands

But it wasn’t just George’s technical sophistication that inspired comment. Articles also appeared that described George’s love for Mae West and his admiration for Hitler.

Robots provide us with an opportunity not just to marvel at their technological wizardry, but also to think about what it really is to be human.

In the same way, as we start to have new types of conversations with online collections, to explore their many-faceted personalities, we will of course be exploring ourselves.

The digital transformation of cultural collections is not about showcasing technology but about creating new online spaces in which we can simply be human.