discontents

The site has been archived…

Tim Sherratt — Mon, 13 Sep 2021 04:25:43 +0000

This blog has been archived and will no longer be updated.

Go to my updates feed to find out what I’m up to.

2017 — the making and the talking

Tim Sherratt — Sat, 30 Dec 2017 06:16:32 +0000

Time to take stock of the year that was…

2017 certainly had its ups and downs. In the middle of the year, my university’s management decided that they could do without undergraduate teaching in cultural heritage. We managed to hold on, but it’s the sort of thing that makes you wonder why you bother.

On the flipside, the highlight of the year was working with my undergraduate class in Exploring Digital Heritage to create and promote a site for the transcription of records used in the administration of the White Australia Policy. It’s something Kate and I have been talking about for many years, so it was a rather emotional experience to see it finally happen. But it wasn’t just about the website — it was about the way my students responded to the records (many of them knew little about the White Australia Policy before we started); it was about seeing the records on the ABC news; and it was about seeing the faces of the people who lived under the weight of the White Australia Policy projected onto the walls of Old Parliament House. That’s why we bother.

Also in 2017 (and somewhat related to the above) I decided to adjust my working arrangements in attempt to find some balance. You can now find me over on Patreon.

The making

Closed Access 2017 update! Complete dataset of records held by the National Archives of Australia that had the access status of ‘closed’ (withheld from public access) on 9 January 2017.
Wrestled with RecordSearch to extract summaries for all series. Along the way I found out some interesting stuff about the use of functions. Putting the two together, I tried to build a picture of what NAA holdings look like through the perspective of government functions.
Played around with Twarc to create a collection of #invasionday tweets. Later in the year I captured and shared tweets with the tag #australianvalues.
Have you ever wondered how the cut and thrust of parliaments past might translate to the world of social media? Wonder no longer, for here you can explore interjections in the Australian parliament from 1901 to 1980, reimagined as tweets. You might even find some emoji…
Code and work-in-progress website exploring the language of Hansard.
Went back to basics with LODBook — now developing a series of components using Jekyll and javascript.
The Redaction Zoo — this collection of creatures was discovered amidst thousands of ASIO surveillance files held by the National Archives of Australia. While the practice of redaction is intended to withhold information from public view, an unknown archivist has used redactions to add an artistic flourish to the files. They are reminders that the processes that limit our access to information are human in their operation and design. There is nothing magical about the ‘secrets’ preserved in government archives. Video first exhibited as part of ‘Beauties and Beasts’, Belconnen Arts Centre, 6-28 May 2017.
Here’s a repository of images in JPG and SVG format drawn from a collection of #redactionart discovered in ASIO surveillance files held by the National Archives of Australia. Use them to create your own #redactionart projects!
I finally added a decent full-text search facility to Historic Hansard. Yay! Search for speeches or bills. Filter by date, house, and speaker. There’s even an option to save your complete result set as a CSV file for further analysis and exploration.
The Tribune was a newspaper published by the Communist Party of Australia. The State Library of NSW holds more than 60,000 negatives and photos from the Tribune which document a wide range of political events and social issues from 1964 to 1991. This is a work-in-progress site documenting my exploration of the negatives as a DXLab Digital Drop-in.
Help transcribe records that document the lives of ordinary people living under the restrictions of the White Australia Policy. The Real Face of White Australia site was built using the Scribe Framework with the help of students from my Exploring Digital Heritage class.
Explore regular updates of data generated through the Real Face of White Australia project. Includes CSV files with transcribed fields, as well as photos and handprints.
There’s lots of exciting new digitised content being added to Trove’s journals zone, but it’s not always easy to find and search. I made an app that lists journals that have been digitised by the NLA and have searchable records for individual articles. This means you can search inside the journal, just like you do in the newspapers zone. The code for the harvester and web app is on GitHub.
Demonstrated how to harvest Parliamentary press releases from Trove. The repository includes a sample dataset of politicians talking about refugees.
Kicked off my new 101 Digital Heritage Hacks site with a userscript to add a ‘copy permalink’ button to Trove.
I made a little bot to post stories from the Real Face of White Australia data to the @invisibleaus Twitter account.
Created two new DIY Trove bots using Glitch. The Trove Tag Bot tweets items with a specific tag, while the Trove List Bot tweets items from one or more lists. Both bots come with detailed instructions and, because they’re hosted on Glitch, it’s easy for anyone to modify the code and launch their own bots.
Added a battle mode to Headline Roulette. It’s really just a way of generating a url to specific newspaper article in Trove. This means groups can compete on a level playing field, but it could have other uses. You could create a curated set of challenges for students, or use them in something like a scavenger hunt.

The talking

13 February 2017 – Digital humanities workshop at ALIAOnline.
24 February 2017 – Keynote presentation at HDR Summer School, Deakin University, The practice of play, <https://doi.org/10.6084/m9.figshare.4696258>
1-3 March 2017, attended ‘Always Already Computational: Library Collections as Data’ national forum in Santa Barbara. Read the position papers and the final statement.
20 April 2017 – Public lecture and workshop, University of Wollongong.
4 May 2017 – Presentation with Professor David Lowe, ‘The Political Language of the 1970s in Australia’, at The 1970s: Australian and Indian Perspectives on a Decade of Transition, Deakin University.
23 June 2017 – Workshop, Random acts of meaning: Digital skills for a post-truth world, NLS8.
4 August 2017, ‘The struggle for access’, presentation at DHPathways 2017, Canberra.
10 August 2017 — ‘Doing DH’ presentation for ACT teacher librarians.
13 October 2017 — ‘Trove tips and tricks’, at the Deniliquin Family History Expo.
8 November 2017 — ‘Making this happen’, presentation at Collaborating around Collections, ANU.
20 November 2017 – ‘”The badge of the outsider”: Open Access and Closed Boundaries’, invited presentation at Sharing is Caring 2017, Aarhus. Also put together a last minute ‘Hack the world’ workshop.
4 December 2017 — Digital methods workshop for the UTS Legal History group

‘The badge of the outsider’: open access and closed boundaries

Tim Sherratt — Fri, 24 Nov 2017 02:53:33 +0000

Presented at Sharing is Caring 2017, 20 November 2017, in Aarhus, Denmark.
You can also watch the video.

In 1946, Britain decided that Australia would be the perfect place to test missiles. The Australian government, keen to play its part in the defence of the Empire, readily agreed. Ignoring, yet again, the presence of Australia’s Indigenous peoples, defence planners thought Australia was attractive to because it was ‘empty’, flat, and far from ‘prying eyes’.

The town of Woomera was built in the South Australian desert to house scientists, workers, and military personnel. It was a town where no housewife could go to the shop without her security pass; where curiosity was ’the badge of the outsider’.

But while Australia’s land seemed ideal for secret military operations, its people remained suspect. Britain’s plans were threatened by concerns about Communist infiltration of the Australian government. Under pressure from the UK and USA, Australia sought to lift its spy game through the establishment of a new agency to monitor such threats — the Australian Security Intelligence Organisation, or ASIO.

Legislation defined ASIO’s functions in very broad terms, ‘to obtain, correlate and evaluate intelligence relevant to security’. From the 1950s to the 1970s, this was used to justify surveillance of a wide range of potential ‘subversives’ — not just known Communists, but writers, artists, academics, scientists, Indigenous activists, and more. Many thousands of files were created to document their beliefs, activities, connections, and personal lives. Recordkeeping was critical to the practice of state surveillance.

We don’t know how many files were kept on ordinary Australians because ASIO is exempt from many of the key provisions of the government’s archives legislation. Unlike other agencies, ASIO does not routinely transfer records or indexes to the National Archives of Australia. Researchers have to go on a fishing expedition, asking the National Archives to ask ASIO whether they might have a file relating to a particular person or organisation. If ASIO admits it has a relevant file in the open period (more than twenty years old), the file goes through an ‘access examination’ process to determine whether it contains information that should be withheld for reasons of national security, or individual privacy. If anything is left, it is finally opened to public access.

Despite these hurdles, more than 12,000 ASIO surveillance files have been made public, though most include redactions — black boxes obscure words too sensitive to be read.

The files have been used in biographies, family histories, and studies of Australia’s literary community. One recent book invited the subjects of ASIO surveillance to reflect on the contents of their own files — to see their lives through a different set of eyes; to explore the intrusions and innuendo that passed for ‘intelligence’.

I’m currently working with a set of 60,000 photographs held by the State Library of New South Wales. These photos were taken for The Tribune, a Communist Party newspaper published in Sydney, and document protest and political activity in Australia from the 1960s to the 1990s. One of the things I’m interested in is finding overlaps between the Tribune photographs and ASIO surveillance files. For example, in February 1972 there was a demonstration on Indigenous rights held outside Parliament House. Because both sets of records have been digitised and made public, we can compare perspectives — spies versus ‘subversives’.

This is a reminder that the impact of digitisation is not simply easier, more immediate, access. We can also see the same things differently. We can interrogate the meaning of access itself.

RecordSearch, the National Archives of Australia’s online database, provides access to about 64,000 series descriptions, 11 million item descriptions, and 1.8 million digitised pages. There’s currently no API, or downloadable datasets at item level, so I make my own.

For six or seven years now I’ve relied on my own little library of screen scrapers to get data out of RecordSearch. They’re slow and they break easily, but they do the job.

Late last year I embarked upon what was probably my most ambitious data harvest. I gathered information about every series listed on RecordSearch and calculated, for each, the quantity of records (in linear metres), the number of individual items described, and the number of items digitised. I then aggregated the series by the top-level functions of agencies associated with them. Basically I grouped them by subject — defence, security, education etc.

Why?

Because digitisation shapes our perceptions of reality. The more we have in digital form, the easier cultural heritage collections are to find and use, the more likely we are to assume that everything (or at least everything important) is online. Ease of access bears an ontological weight — if we can’t find it online, does it exist?

Now that might not be a problem if what was digitised somehow provided a representative sample of the whole. But we all know how such decisions are shaped by political priorities, funding opportunities, user demand, public events, and happy accidents. There’s nothing necessarily wrong with that, it’s just the environment within which we work. There are never enough resources. We have to do what we can, when we can.

The problem is, we rarely expose the impact of these decisions to the users of our digital collections. We rarely give them the chance to reflect on how our decisions shape their assumptions.

The National Archives of Australia documents the workings of our democracy. If offers one important perspective on who we are as a nation. If we look at the quantity of records associated with each top-level function, we see a fairly even distribution. Nothing stands out.

By quantity (linear metres)

But what happens when we view the activities of government through the number of files digitised in each subject area?

By number of items digitised

The prominence of defence is really no surprise. Service records are heavily used by family historians, and in 2007 the Australian government funded the digitisation of all 375,000 World War I service records in what was branded as ‘A Gift to the Nation’.

The National Archives is not alone. I often show people this graph of the number of digitised newspaper articles in Trove, pointing out the fairly dramatic peak around 1914. Did something happen in 1914? Were there more articles published, more newspapers? No, there’s just more money. In the lead up to the centenary of WWI, funding was directed towards the digitisation of newspapers from the wartime period.

Again, there’s nothing wrong with this. It’s just that these biases are not obvious to someone typing queries into a search box. In the context of Australian history, these decisions around digitisation help to reinforce the long-held belief that Australian national identity was somehow forged on the battlefields of WWI. It helps to put war at the centre of our history, at the centre of who we are.

But of course while digitisation can shape our assumptions, it also gives us new opportunities to critique them. I could only analyse the holdings of the National Archives because their collection data is online. We don’t have to take just what the search box delivers — we can ask our own questions. But this is only possible if people have the skills, the tools, and the confidence to poke around in the data. This too is access. Institutions should invite the public not to swoon at their digital delights, but to hack away at difficult questions — not to see collections, but to see them in unexpected and challenging ways.

I mentioned that ASIO files go through an process known as ‘access examination’ before they’re released to the public. This is the case for all records more than twenty years old, not just the super secret ones. The vast majority of files are simply opened without restriction. Some, including most of the ASIO files, are opened ‘with exceptions’ — pages can be withheld, and text redacted. A few are withheld from the public completely. They have entries in RecordSearch, but you can’t see them — their access status is officially ‘closed’.

But because the metadata about access decisions is available online, we can start to build a picture of what we’re not allowed to see.

At the start of 2016, I harvested the details of all files in the National Archives of Australia with the access status of ‘closed’. I’ve aggregated and sliced the data in a number of different ways, so you can explore the age of the files, what series they came from, and when decisions were made about their access status. At any point you can drill down to a list of the files you cannot see — making it perhaps the most frustrating search interface ever devised.

Reasons why files are closed

You can also examine the reasons why files have been withheld. Many of these exceptions are defined by the legislation that established the National Archives. Clause 33(1)(a), for example, relates to national security, 33(1)(g) is concerned with individual privacy. But the metadata reveals that files are withheld for a number of other reasons, such as ‘Pre Access Recorder’ and ‘Withheld Pending Advice’. There’s also, you might note, a category entitled ‘MAKE YOUR SELECTION’ — which reveals something about the limits of the data entry interface.

By poking around you in the data you can make some guesses as to how these additional categories are used. ‘Pre Access Recorder’ is used as a catch-all for records that were withheld from public access before the archives legislation was passed. ‘Withheld Pending Advice’ is used to label files that have been sent off to other government agencies for their assessment — they’re not yet finally closed, but as this process can take years, they’re sort of closed. Indeed, my interface shows that 1,467 files have been waiting more than three years for advice.

The point of this is not to embarrass the National Archives, nor the Department of Foreign Affairs and Trade which holds the most files in limbo. The point is to examine the ways in which access itself is constructed. Legislation defines an ideal, but the reality is more messy and human. By tracking patterns in the way access decisions are made we can explore the historical processes at work. Access is not allowed, it is made.

Remember those 12,000 ASIO files publicly available through the National Archives? You might not be surprised to know that I’ve harvested them all — both the metadata and the 300,000 digitised pages. There’s about 70gb of images.

Using these files we can dig a little deeper into the nature of access. I wrote a computer vision script to find redactions. It took a lot of trial and error, and I’m about to start work on a smarter version that incorporates machine learning, but it did the job. From one series of ASIO files, about 230,000 pages, I extracted 239,000 redactions — lots and lots of little black boxes. You’ll be pleased to know that not only can you download the complete set of redactions from the research repository Figshare, you can browse them. All of them! Hours of fun for all the family!

redacted

The interesting thing about this interface is that if you click on a redaction you can view the page that it was extracted from. So it’s sort of an inside out discovery interface. Instead of the redactions being a brick wall or a dead end, they’re a starting point. A practice intended to remove information, to limit access, becomes a gateway for exploration. Indeed, the redactions themselves provide an identifiable data point — something that can be analysed to turn the gaze of government surveillance upon itself.

But something else was hiding in those ASIO files. As I was reviewing the collection of redactions for false positives I discovered that someone tasked with the removal of information, decided to add a little creative flair.

I discovered #redactionart.

I assure you that these creations really are sitting inside ASIO files held by the National Archives. But since I’ve discovered them, they’ve developed a life of their own. Not only can you browse through them online, you can wear them.

I gave away about 80 of these badges at an exhibition earlier this year. To create the badges, I simply traced around the original images and saved the results as SVG files. These files themselves are shared through GitHub for anyone wanting to create their own #redactionart.

This amazing #redactionart dress was made by Bonnie Wildie, a librarian in NSW. My SVG files have also been turned into a set of 3D printable cookie cutters, as well as a range of t-shirts and stickers on RedBubble.

This escape from the archives is not only creative and fun, it’s important.

It’s important because it emphasises that the practices through which government information is controlled and withheld are profoundly human. People make decisions and they leave their marks. There is nothing mysterious or otherworldly in the secret — it is an exercise of power.

Archives are not just made of documents — there are people inside.

‘Surveillance’ is not included in the National Archives’ official thesaurus of government functions, yet the movements and activities of individuals are recorded in many thousands of files across an assortment of agencies.

A simple query of my harvested data reveals that the phrase ‘alien registration’ appears in the titles of only 29 series. But these series contain more than half a million files. 4.7% of digitised files in the National Archives document the movements of so-called ‘aliens’. While these registration systems were created during wartime, they lingered beyond. And they were not the only means of keeping track of potential threats. Just as at Woomera, boundaries were drawn, and outsiders marked for attention.

When I was last in this part of the world, I talked a bit about some work that Kate Bagnall and I had done with records of the White Australia Policy held by the National Archives.

A quick recap — when the Australian colonies federated in 1901, it was generally assumed that the new nation’s future could only be assured through strict racial homogeneity. A ’white’ Australia was a strong Australia. Legislation was quickly passed to restrict immigration and set the foundations for what became known as the White Australia Policy.

However, in 1901 there were around 40,000 people living in Australia whose background was neither European nor Indigenous — they were Chinese, Japanese, Syrian, Indian, and Malay. Some had been born in Australia, or had lived there for many years — raising families, building businesses; just living their lives.

If any of these people wanted to travel overseas they had to carry special documents, or they might not be allowed to return home. Customs officials at Australian ports would ask anyone who seemed not to be ‘white’ for identification. The badge of the outsider was the colour of their skin.

An example of a Certificate Exempting from Dictation Test, NAA: ST84/1, 1909/21/91-100, p. 35-6

Many thousands of these documents, the remnants of a racist bureaucratic system, are preserved in the National Archives.

Back in 2011, I downloaded about 12,000 of these documents from RecordSearch and ran them through a facial detection script to create a seemingly endless scrolling wall of faces. We called it ‘The Real Face of White Australia’. It’s another inside-out interface — instead of showing the files, you see the people inside.

That was then, this is now! In the last few months, I’ve been working with a group of my digital heritage students to develop a website for the collaborative transcription of these same records. We want to put names to the faces. We want to chart their journeys. We want to document their lives.

Our project has no funding, and was only possible because Zooniverse and the New York Public Library created and shared Scribe, a framework for the transcription of structured documents — an easy way to get usable data out of forms, ledgers, and certificates.

The site was launched at a ‘transcribe-a-thon’ held at the Museum of Australian Democracy in Canberra, which just happens to be located in Australia’s first parliament house. The building didn’t exist when the Immigration Restriction Act was passed in 1901, but it was where the White Australia Policy was elaborated and maintained.

Busy transcribers at the Museum of Australian Democracy

Transcription continues. There’s still much work to do on the documents, but data is already flowing. I’m making regular dumps available for download through a GitHub repository.

But it was never just about the data. Many more people now know that these records, this history, exists. Through the process of transcription you are confronted by the disturbing reality of the records — you’re surprised, puzzled, shocked, and often moved. Creating a space for these sorts of experiences is important in itself.

The Museum of Australian Democracy not only gave us their building for a weekend, they let us play with their data projectors. In some ways, I would have been happy if all we had achieved was this — to put these faces in this space.

Once again the gaze of surveillance is reversed. In the home of Australian democracy, people who lives were monitored under a racist system of exclusion and control were looking at us, asking questions of us.

Amongst those Tribune photos at the State Library of NSW, I recently found this. Believe it or not, I’m the spy on the right. This compelling piece of street theatre was performed at the gates of Pine Gap, a US electronic surveillance facility right in the centre of Australia. Pine Gap’s lease was due for renewal in 1987, so hundreds of protestors converged on the site, hoping that the Australian government might withdraw it’s support. Needless to say, it didn’t. Pine Gap remains, and in recent times has been implicated in US drone strikes

I found another photo of myself amongst the Tribune archives. A group of us climbed over the outer perimeter fence in the middle of the night and took up positions on a rocky outcrop that overlooked the main gate. At a predetermined time, we leapt out of our hiding places and lit smoke flares. I was arrested soon after, charged with trespass, and fined $100.

Another group of Pine Gap protestors are currently on trial in Australia. They made it through the protective fences and dared to play music and pray. For this they have been charged under the Defence (Special Undertakings) Act which carries a maximum sentence of seven years in prison. This Act was passed in 1952 when Britain decided to expand its weapons testing program in Australia to include atomic bombs. It expanded upon earlier legislation that had been intended to protect Woomera from Communist interference. This is one of the very few times anyone has been charged under the Act, despite there being hundreds of arrests like mine in the past.

As security services gain new powers, and electronic surveillance expands, it’s hard not to see the Pine Gap proceedings as an attempt to discourage criticism of the government’s tough on terrorism stance.

At a recent symposium on collaboration between researchers and collecting institutions, Seb Chan described some of the advances that had taken place in opening up collections, but then asked ‘So what?’.

I suppose that’s the question we’re hear to discuss. Why do we put all this effort into digitising collections, building interfaces, and sharing data? Easier access is great, beautiful interfaces are cool, but… so what? For me, as a historian, hacker, and sometime heritage professional, the answer is straightforward — it’s all about bringing the past into conversation with the present. It’s about mobilising our collections as critical resources in debates about who we are, what matters, and why we should care.

Transcribe-a-thon poster designed by Emily Fry

Those inky, black handprints on the White Australia records moved one of my students to reflect on her experience as a recent immigrant from Canada, required by the Australian government to supply a set of her fingerprints. She wrote a beautiful talk and presented it during the transcribe-a-thon in the original House of Representatives chamber at Old Parliament House. Another student noted in her final essay that the documents made non-white residents seem like criminals, pointing to parallels with the current treatment of refugees. On the flip-side, our efforts attracted the attention of a few racist trolls, one of whom referred to the White Australia Policy as ‘the good old days’.

Once again ’outsiders’ are being targeted as threats to our security. Boundaries are being reinforced, and efforts being made to define who belongs. We know this. We’ve seen this before. Europeana’s new project on the history of migration is an important initiative — we need to tell our stories, share our resources, grapple with our difficult and painful pasts. I don’t think this is a time to reassert the authority of our cultural institutions as reservoirs of truth. We are implicated in all of this. Our collections are built upon systems of surveillance, on attempts to put humans into categories. They are products of power and privilege. We are not the guardians of enlightenment, we are the keepers of horrors.

Just like #redactionart, the value of our collections lies in their complexity and contradictions — in their very humanity and all the confusion that entails. Digital collections lend themselves to an exploration of complexity. We can shift scales and perspectives, we can manipulate contexts, we can set collections loose in public spaces, we can turn them inside out. We can see differently, but perhaps more importantly, we can feel differently.

When you think about it, ‘impact’ is a pretty violent sort of word. There are perhaps a few people around the world we’d like to ‘wallop’ with our digital collections. But I suspect most of the time we’re after something more subtle — to expand possibilities, to undermine assumed certainties, maybe even to expose a glitch in the Matrix.

Perhaps we can offer a glimpse of an alternative reality, where we recognise the outsider as us.

Maybe in the end we will be able to see the outsider as us. Hear hear! #ShareCare17 pic.twitter.com/TWWYqyvCDg

— Johanna Berg (@johannaberg) November 20, 2017

Finding a balance

Tim Sherratt — Thu, 02 Nov 2017 11:20:59 +0000

As you can see, my blog has been rather quiet this year. This is mainly because I’ve tried to scale back on talks and presentations. I’ve also been posting elsewhere — listing projects on my portfolio site, documenting workshops in my digital heritage handbook, and adding bibs and bobs to my research notebook. I’m glad that I’ve managed to say ‘no’ to a few things. With two academics in the house it can be a bit of a challenge coordinating schedules to ensure that there’s at least one adult at home with the kids.

But no matter how I try to adjust my working arrangements, I struggle to get the balance right. The thing I feel I should be doing is making things — useful things, weird things, political things; things that help people see our history and our cultural heritage collections in different ways. I really don’t care if it’s classed as research, as infrastructure, or as a ‘service’. I just think it’s important to help people get a broader sense of the possibilities brought by digital technologies — to be creators, not just consumers.

But how do I find the time? And earn a living?

In yet another attempt to shift the balance I’ve recently adjusted my contract position at the University of Canberra to work part-time, three days per week. I’m hoping to spend the other two days working on a variety of projects, both paid and unpaid — so yes, I’m available for hire. But I also want to explore other funding models. Why should digital research infrastructure always be big, national, centralised, and expensive? Are there ways of supporting the creation of useful tools that aren’t dependent on large grants and multi-institutional partners? Let’s find out.

A couple of weeks ago I set up a Patreon site. If you use any of the tools I’ve created, or think that what I do is interesting, perhaps you’d like to become a supporter. My first goal is to try and cover the hosting costs of my various online projects. My current job is not secure, and I don’t want to be in a position of having to turn things off because I can’t afford to keep paying the bills. Thanks to some wonderful and generous people, I’m already about 40% of the way. Beyond that, I’m hoping that more supporters will give me more time, more motivation, and more ideas.

I don’t know if I’ll find the balance I’m seeking, but it’s worth a try.

Perhaps not coincidentally, Kate just gave me a new keyring for my birthday…

The practice of play

Tim Sherratt — Sat, 25 Feb 2017 02:36:10 +0000

Keynote presentation at the Deakin University Faculty of Arts and Education HDR Summer School, Geelong, 24 February 2017.

Cite as: Tim Sherratt, ‘The Practice of Play’, presented at Deakin University Faculty of Arts and Education HDR Summer School, Geelong, 24 February 2017, online at <https://doi.org/10.6084/m9.figshare.4696258>.

I’m a historian. But in the past decade the nature of my research has changed quite profoundly. Instead of heading off to the archives, taking lots of notes, and writing up a book or an article, I now make things. Generally these things are online, and open to the public. I make things for people to use, to explore, to play, and to ponder.

I started down this track before I realised there was a name for what I do – practice-led research. The things that I make even have their own acronym – they’re NTROs, or Non Traditional Research Outputs.

But practice-led research is not just about making things. New knowledge is generated through cycles of creation and reflection. My aim in making is not to follow a blueprint, or check off a list of requirements, but to end up asking ‘What is this thing?’, ‘What does it do?’, ‘How does it do it?’.

In the past, I’ve tended to talk about my research practice as playing with data. I think there’s an important argument to be made for the role of play in research, particularly when confronted with large cultural datasets. But ‘play’ doesn’t quite capture what I do, nor does it look very convincing in a research proposal. So what do I really do?

Let’s play a game.

Headline Roulette is a very simple game. Presented with the title of a digitised newspaper article drawn at random from Trove’s collection of more than 200 million you are challenged to guess the year in which the article was published. Sounds easy, but you only get ten guesses. It’s sort of like a cross between hangman and The Price is Right.

Despite its simplicity, I’ve known it to unleash the competitive instincts of a workshop full of historians. But for me, Headline Roulette is important because it provides an example of what becomes possible once we make cultural heritage collections available online. Our interactions are no longer limited to conventional modes of viewing or reading – we can play, and we can build.

I made the first version of Headline Roulette back in 2010. It was a game, but it was also an argument about access and possibilities.

Perhaps we should first take a step back. Who’s used Trove?

Trove is a fundamental part of Australia’s research infrastructure – and not just for those of us in the humanities or social sciences. Trove is a lot more than digitised newspapers, but access to more than 150 years worth of digitised newspapers has profoundly changed historical practice.

I say this not just because you have been spared the pain and suffering wrought by microfilm readers upon a generation of historians, but because the meaning of access itself has changed. Headline Roulette is just one simple and silly example of how once cultural heritage resources are in digital form we can use them differently. We can see them differently.

Imagine your search in Trove’s newspapers zone returns 10,000 or 100,000 results. How do you make sense of that? How do you get an understanding of the whole, when all you see is page after page of search results?

QueryPic extracts data from Trove to visualise your search as a single chart – showing you the number of articles per year that match your query. You can even compare the occurrence of particular words or phrases.

But that’s only the beginning, because once you think about web resources as data rather than just another type of publication you can aggregate and analyse – you can look for big, dramatic pictures as well as tiny, fragile fragments.

Trove Harvester is tool that delivers historical newspaper articles in bulk – thousands, even millions of articles saved to your computer for offline exploration.

What might you do with a million newspaper articles?

Research using digital resources like Trove is not constrained to the window of your web browser. You can ask new types of questions.

But back in 2010–11 when I created the first versions of Headline Roulette, QueryPic and the Trove Harvester there was no easy way of getting data out of Trove. The thing is, web pages are good for delivering data to human beings, but not so good for computers. Computers are actually pretty dumb, and you need to be quite explicit in packaging up data for them. Nowadays Trove has a thing called an API (an Application Programming Interface) which delivers data in a carefully structured format that even computers can understand. You can use APIs to harvest data, or to build new tools or interfaces. APIs are cool.

Without an API, the first versions of my tools had to turn human-readable web pages into computer-readable data – a process known as screen scraping. They were, therefore, not only useful or interesting applications in their own right, they were arguments about why things like APIs matter. Why web pages aren’t enough. Why researchers need access to data.

These are arguments we’re still making. Next week I’m heading to a workshop in California where we’ll be discussing how libraries and other cultural institutions can deliver their data in ways that support new forms of research.

But we don’t have to wait. By screen scraping web pages, by reverse engineering online databases, we can continue to develop the argument for access by extracting, sharing, and using data.

What could you do with 70gb of digitised surveillance files from the Australian Security Intelligence Organisation (ASIO)? I’f you’d like a copy I have them here on a USB drive.

Don’t worry – we’re not about to be raided by the security services. These are all files that have been carefully examined and released to the public through the National Archives of Australia. You can find them by searching the Archives’ online database – RecordSearch.

Who’s used RecordSearch? It’s not the most friendly system, but the collection it documents, and the metadata it provides, is rich and wondrous. I’ve spent a lot of time trying to get useful data out of RecordSearch – not just ASIO files, also records documenting the administration of the White Australia Policy, as well as higher-level data aimed at building my understanding of how the Archives, and its descriptive systems, actually work.

It is painful and frustrating work. But, I would argue, it is research. Terms like ‘data mining’ and ‘text mining’ fly around all the time, making it seem as if the the accumulation of data is a mechanical process – as if we’re just digging it up. But the practice of screen scraping, or of liberating data from any cultural heritage source, is not simply extractive – it’s iterative and interpretative. It’s a process through which you begin to understand how the data is organised, what its limits and assumptions are, what its history is. What it means. We’re not just taking things out, we’re putting them back.

Frederick Gibbs and Trevor Owens argue that historical data need not be deployed solely as statistical evidence. ‘It can also help’, they suggest, ‘with discovering and framing research questions’ – questions, not answers; interpretation not calculation. Gibbs and Owens describe an ‘iterative interaction with data as part of the hermeneutic process’.

For me, RecordSearch is like an archaeological site. Excavating data from it involves digging through layers of technology, institutional history, and descriptive practice to try and understand why we have what we have.

Those of you undertaking projects using the collections of the National Archives will almost certainly come across the process of ‘access examination’. Under the Archives Act, government records more than twenty years old are expected to be opened to the public. However, the act also defines a number of exceptions to this rule – for example, records that endanger national security or infringe an individual’s privacy can be completely, or partially, withheld from scrutiny. The process of assessing records against this set of exemptions is called ‘access examination’.

The vast majority of records are opened without problem – they are, after all, more than 20 years old. But a significant number are not. While you can’t use these records, RecordSearch does provide some information about them. So I decided to see what we couldn’t see.

In January 2016 I fired up my screen scraper and harvested details of all the files in RecordSearch that have the access status of ‘closed’ – there were 14,370 of these files that had been through the process of access examination and withheld from public view. I then created my own interface that lets you explore this data from a variety of angles – such as the reasons why files were closed, when decisions were made about them, how old they are, and which government agencies created them.

It is perhaps the most frustrating search interface ever devised, given that you’re not allowed to see any of the files you find.

Those of you currently planning research projects might be interested to know where most of these files come from. It’s not defence or the intelligence agencies, but what is now the Department of Foreign Affairs and Trade (DFAT) – in January 2016, there were 1,747 closed files from just one DFAT series. But if you dig deeper you see that most of these files aren’t withheld for one of the reasons defined by the Archives Act, they are described as ‘closed pending advice’. The National Archives is still waiting to hear back from DFAT about them. Using my interface you can see that there were 54 files in this series where the Archives has been waiting for more than five years. So if you’re embarking on a project using the National Archives, make sure you get your access examination requests in early. Just in case.

My aim in extracting and sharing this data is to better understand access examination itself as a historical process. It’s work that enables us to ask different types of questions, but it also makes a change in the process itself. My interface is public, offering a critical commentary on the ‘official’ system. As a result of my research, the Archives has made changes to the way it describes closed files. It’s both research and intervention, history and hack.

‘Hack’ has a number of definitions, both positive and negative. Mark Olsen describes the ‘hacker ethos’ as:

‘a way of feeling your way forward through trial and error, up to and perhaps beyond the limits of your expertise, in order to make something, perhaps even something new. It is provisional, sometimes ludic, and involves a willingness to transgress boundaries, to practice where you don’t belong… Whether eloquent or a kludge, a hack gets things done.’1

Olsen explores what hacking means in the context of the humanities, arguing not only that hacking has a legitimate place in humanities practice, but that the humanities itself needs to be hacked to foster the development of new skills and literacies.

At this point you’re probably thinking, ‘But I don’t do any of this wacky digital stuff, what has this got to do with me?’

Who’s heard of filter bubbles, or search personalisation? Who’s read one of the many reports recently about the way computer algorithms are shaping our online experience? Olsen argues for a humanities practice that equips us to wrestle with complex techno-social systems.

And we’re not just talking about Google.

Last year Matthew Reidsma published an analysis of algorithmic bias in library discovery systems. He hacked a common commercial library product to show some of the biases underlying its recommendations system. The interfaces we use to access information are never neutral. The databases we search are products of selection and exclusion. Hacking enables us to interact with these systems as critics, and not just consumers.

Using the Trove API you can create a chart showing the number of digitised newspaper articles available per year from 1803 onwards. If you do this, you’ll notice two significant features. First, there is a dramatic drop-off in the number of articles after 1954. This is the ‘copyright cliff of death’. Few things are certain in our overly-complex copyright system, but 1954 provides a practical cut-off point. History stops in 1954.

You’ll also notice a substantial peak in the number of articles around 1914. Why might this be? Did something significant happen in 1914?

In fact, it’s all about money. In the lead up to the centenary of WWI it was decided to focus limited digitisation resources on newspapers from the WWI period. It was a perfectly reasonable decision, but the consequences are effectively invisible to any user of the web interface. You don’t know what you’re searching.

The power of Google encourages us to put a lot of faith in search interfaces. We trust that they will just work. And if we can’t find what we’re looking for, we often assume that it doesn’t exist.

Hansard, the recorded proceedings of the Australian parliament from 1901 can be searched using the ParlInfo database on the Australian Parliament House website. Perhaps you’ve used it – it’s a wonderfully rich resource. Powering the search results are a series of well-structured XML files, one for each sitting day, that identify individual debates and speeches.

Last year I reverse-engineered ParlInfo and harvested all those XML files. I thought they’d provide a great dataset for exploring changes in political speech, and so I created a repository containing all the files for the House of Representatives and the Senate from 1901 to 1980. Feel free to download and play.

But in the process of harvesting the files I noticed that some of the XML files were empty. After a bit more analysis I realised that about 100 sitting days were missing – they didn’t show up in search results on ParlInfo.

The ‘missing’ days were concentrated in the Senate between 1910 and 1920. So anyone relying on ParlInfo to research the WWI period would have missed significant amounts of content. This ‘black hole’ was effectively invisible to any user of the web interface. It was only though hacking that its shape and extent was revealed.

Fortunately staff at the Parliamentary Library have investigated and fixed the problem. But it’s a good example of why we should, as researchers, start from the assumption that search interfaces lie. Processes of selection and description shape the ‘reality’ of online collections. We then explore them through complex technological systems that appear comprehensive, even when they are not. You can’t find what’s not there. Online collections hide as much as they reveal.

Of course this is true of all historical sources. We are trained to analyse both context and content, to make judgements about authenticity and accuracy. These same skills need to applied to digital resources, to data. Indeed, Gibbs and Owen argue that ‘historians must treat data as text, which needs to be approached from multiple points of view and as openly as possible’. But how do we find multiple points of view when interfaces construct our experiences and limit our perspectives. How do we open data to new possibilities? How do we see data differently?

No doubt you’ve been encouraged to find a way of expressing your research questions succinctly, in a way that communicates with a non-specialist audience – yes, I mean the dreaded elevator pitch. You’re not the only one.

I’ve landed back in academia after a number of years working in cultural heritage institutions, and pursuing my own research interests with the support of the international digital humanities community.

Believe me when I say, Twitter changed my life. There I was, hacking away on cultural heritage data without any real assistance or encouragement, when I discovered, via Twitter, that there were people out there like me. Many of these people are now my friends, and I’ve been lucky to travel around the world to meet and work with them.

But coming back to academia I’ve found that my collection of projects, tools, experiments, and obsessions was not quite enough – my research needs a ‘narrative’.

So, like you, I’ve had to think about why I do what I do. What motivates my research? What matters?

For me it comes back to the nature of this thing we call ‘access’. Cultural heritage organisations talk about ‘access’ all the time, particularly in relation to online collections. But what does it actually mean? I want to overturn our assumptions about access – exploring it not as a process of opening things up, but as a system of controls and limits. It’s not a state of being, it’s a struggle for meaning and power.

My methodology, and I think I can call it that, is the multiplication of contexts. Context is, of course, critical to cultural heritage collections – it enables us to locate them within history and culture, to analyse their authenticity, to mobilise their value as evidence. But the descriptive systems we use to manage and explore collections represent only a privileged subset of possible contexts.

Now I’m still figuring this out, but I think what my work does is that it removes collections from these highly-controlled systems and lets them loose in a variety of new contexts. This allows unexpected features, or new uses, to emerge – we see them differently, and in that moment, the nature of access shifts, however slightly. It’s those moments I’m trying to catch and observe.

If you’ve ever tried to use Hansard through the ParlInfo database you’ll realise that it’s just really difficult to read. You’re presented with a series of nested fragments, so it’s hard to get a sense of the context and flow of the day’s proceedings. Having downloaded all those XML files, I thought I’d have a go at presenting Hansard in a form that privileged reading over search.

So I created Historic Hansard – dedicated to lovers of political speech. It does nothing very fancy, but I think it does it pretty well.

In the end, however, Hansard is still just text. What’s lost in the documentation process is the performance – the theatre of parliament. But not completely. As well as formal speeches, many interjections have been recorded and preserved.

A few weeks ago I extracted all those interjections from 1901 to 1980, about a million of them, and saved them to a new database. As I fiddled with different presentation methods, I started to see them as something akin to tweets – quick, pithy, and pointed. What would happen, I wondered, if we reimagined interjections from a century ago in an age of social media.

Like many of my projects, this whatever it is took me a couple of days to build. No research grants were harmed in its creation, no committees were needlessly formed. This is not because I’m a whizz-bang coder – I’m certainly not. It has to do with the nature of this work – it’s rapid, experimental, and sometimes even ephemeral. I don’t design websites, I make interventions – things that are not only of the world, but in the world. They do something.

Stephen Ramsay explores the hermeneutical possibilities of screwing around with technology and texts. The ‘screwmeneutical imperative’ he suggests is based on the fact that:

’a writerly, anarchic text… is more useful than the readerly, institutional text. Useful and practical not in spite of its anarchic nature, but as a natural consequence of the speed and scale that inhere in all anarchic systems.

Digital technologies give us the opportunity to play with scale and speed. We can manipulate millions of newspaper articles, and we can build a new version of Hansard in a weekend. But this shift also applies to the way we communicate. Instead of waiting months or years for an article to appear in print, we can post it on a blog, or in a digital repository. It is fundamental to the work that I do that it is shared, it is public by default – not just the results, but the code, the data, the process, and yes the licensing. Access is not just what we take, it’s what we do.

The multiplication of contexts has some interesting precedents as a research methodology. In the literary world the Oulipo movement sought to play with the contraints of composition. Lisa Samuels and Jerome McGann suggested that the deliberate misreading of a text, what they termed ‘deformance’, could yield critical insights. More recently, Mark Sample has argued for a ‘deformed humanities’ where we learn about things by breaking them.

In history we have the counterfactual – a creative reimagining of a past that never was, aimed at revealing perspectives and possibilities too quickly closed and forgotten. As Sean Scalmer argues, ‘counterfactuals are fun’:

‘Conventions can be disregarded, or even mocked. Worlds might be remade, the tyrannical overthrown, and the humble elevated. New orders can be imagined.’2

But counterfactuals are not fiction. They work best when they sail close to an accepted version of the past; when they play with the constraints of documentary evidence rather than just ignore them. Just because an approach is playful, it doesn’t mean that there are no rules. As Ian Bogost has recently argued, the fun of play is ‘not doing what we want, but doing what we can with what is given’.3 Play is an investigation of limits.

While some of the ASIO files held by the National Archives are closed to the public, most are ‘open with exception’. This means that sensitive parts of the files have been removed. Whole pages can be withheld, or sections of text blacked out – a process known as redaction.

A redaction is, by definition, an absence of information, and yet the frequency, density, and placement of redactions across a large collection of documents could conceivably tell us something interesting. So last year I wrote a kludgy computer vision script that found and extracted redactions from digitised ASIO files. I now have a collection of 250,000 redactions which I’ve shared on Figshare – grab a copy now!

I’m continuing to explore the possibilities of these redactions as data points. But there was also something visually interesting about the redactions, particularly when they were assembled on masse.

Here you can browse all 250,000 redactions. But that’s not all, you can also use them as entry points to the documents they were intended to obscure.

Contexts here have been reversed, the files have been turned inside out – the limits remain, indeed the scale of redaction is emphasised, and yet within these limits, perhaps even because of these limits, we can experience the files quite differently. We are no longer simply the subjects of state surveillance, we can reverse the gaze, inspect the process, and ask new questions.

The manipulation of contexts is not mere invention. The limits of access offer both meaning and rules. We have skin in this game, its outcomes matter, what is at stake is our ability to see, and be seen, within the cultural record. Access changes who we can imagine ourselves to be.

In the first half of the twentieth century, if you were deemed not to be ‘white’ and wanted to travel overseas from your home here in Australia, you had to carry special documents. Without them, you’d probably be stopped from returning – from coming home. This was ‘extreme vetting’ White Australia style.

Many thousands of these documents are now held by the National Archives of Australia. In 2011, I used my screen scraper to harvest about 12,000 images like this from RecordSearch. I then ran them through a facial detection script and created The Real Face of White Australia.

There are about 7,000 faces in this seemingly endless scrolling wall. And that’s from just a small sample of the White Australia records. It’s powerful, compelling and discomfiting. But the power comes not from any technical magic, but from the faces themselves – from what we feel when meet their gaze. Once again the records have been turned inside out – instead of seeing files, metadata, or a list of search results, we see the people inside.

Play can be serious. It can make you feel things you don’t expect. It can challenge your certainties and take you to the limits of what you know.

That sounds a lot like research to me.

M. J. Olson, ‘Hacking the humanities: Twenty-first-Century literacies and the “becoming other”of the humanities’, in Eleonora Belfiore and Anna Upchurch (eds), Humanities in the Twenty-first Century: Beyond utility and markets, Palgrave Macmillan, 2013, pp. 237–250.
Sean Scalmer, ‘Introduction’, in Stuart Macintyre and Sean Scalmer (eds), What if? Australian history as it might have been, Melbourne University Press, Melbourne, 2006, pp. 1–11.
Ian Bogost, Play Anything: The Pleasure of Limits, the Uses of Boredom, and the Secret of Games,Basic Books, New York, 2016, p. 236.

2016 — the making and the talking

Tim Sherratt — Wed, 21 Dec 2016 00:14:06 +0000

The image above is from Geoff Hinchcliffe’s awesome visualisation of more than 12,000 #fundTrove tweets.

This year I sadly left the wonderful team at Trove and took up a full-time academic post at the University of Canberra. But it was Trove that dominated the early part of the year, as the impact of continual funding cuts on the National Library of Australia became clear. Users of Trove shared their feelings on Twitter and Facebook, organisations posted statements of support, and numerous articles appeared in the media. In the lead up to the federal election, both the Greens and ALP made commitments to support Trove and our national cultural institutions.

In the last few days, we’ve learnt that the Government will provide $16.4 million over four years to the NLA ‘for digitisation of material and upgrade of critical infrastructure for its Trove digital information resource and to upgrade other critical infrastructure’. While we wait to hear exactly what this means for the future of Trove, it’s important to remember that it comes after many cuts and job losses across the cultural sector. The lesson of #fundTrove is that we cannot take the future of our collecting organisations for granted. We need to show why they matter and fight for the resources they need.

Access is important — both its politics and its practicalities. This year I’ve tried to be a bit more rigorous in the way I share information and document my projects. I created a Digital Heritage Handbook where I publish workshops, activities, and other bits and pieces. Much of it is in draft form, but I decided it was better just to push everything out in the hope that it might be useful. Similarly, I created an Open Research Notebook to share work in progress. The Handbook also includes details of the two undergraduate units I taught in second semester — Working with collections, and Exploring digital heritage. I think they went pretty well, but I’ve got a few improvements planned for 2017.

This year I accidentally built my own version of Historic Hansard, created an interface to National Archives files we’re not allowed to see, and mined ASIO surveillance files for redactions. As well as these major projects, there were lots of little hacks and harvests aimed at exploring the idea of ‘access’. You can follow my main research obsessions in my notebook:

Talking and making details follow…

2016 — the making:

Locating Trove newspapers
Updated code, data, and interface to geolocate and display Trove newspaper titles. Now with maps!
Headline Roulette
Much needed update for my old game. Now on it’s own domain and with better handling of Trove API errors.
DFAT Documents
Demonstration code to harvest the Department of Foreign Affairs and Trade’s collection of historical documents and extract some metadata. The harvested documents are available in Markdown format and can be explored through a simple website.
People of Australia
@people_aus is a Twitter bot sharing random names drawn from late 19th and early 20th century naturalisation records held by the National Archives of Australia. Many names. Many cultures. These are the people of Australia.
RecordSearch Series Harvests
Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia. Data from an assortment of harvested series are available as CSV files.
SRNSW indexes
Code for harvesting indexes from the State Records of NSW website. Data from 59 harvested indexes is available as CSV files.
Facial detection demo
Code and website to demonstrate the principles of facial detection using OpenCV.
Show Redactions userscript
Code for inserting details of redacted files into RecordSearch results.
ASIO Experiments
Code used for the extraction of redactions and other experiments with digitised ASIO files.
Redactions dataset
Redactions extracted from ASIO surveillance records in National Archives of Australia Series A6119, <https://dx.doi.org/10.6084/m9.figshare.4101765.v1>
Non redactions dataset
False positives (non-redactions) extracted from ASIO surveillance records in National Archives of Australia Series A6119, <https://dx.doi.org/10.6084/m9.figshare.4104651.v1<
Redacted
Web interface for exploring redactions extracted from digitised ASIO files. Includes a collection of redaction art.
Open with Exception browser
Code and website providing an experimental browser for digitised ASIO files from the National Archives of Australia.
Invisible Australians browser
Updated code and website providing an experimental browser for digitised records from the National Archives of Australia relating to the administration of the White Australia Policy. Now includes a landscape view for exploring records by their orientation.
Closed Access harvester
Updated code for harvesting and analysing records from the National Archives of Australia with the access status of ‘closed’.
Closed Access dataset
Complete dataset of records held by the National Archives of Australia that had the access status of ‘closed’ (withheld from public access) on 1 January 2016.
Closed Access website
Public web interface for the exploration, analysis, and visualisation of ‘closed’ records in the National Archives of Australia.
RecordSearch Functions
Code and documentation for analysing the performance of functions by Commonwealth government agencies over time, using data from the National Archives of Australia.
Commonwealth Hansard XML repository
A repository of the (almost) complete proceedings of the Commonwealth House of Representatives and Senate from 1901–1980. This comprises several gigabytes of XML-formatted files harvested from the ParlInfo database.
Historic Hansard
A public website that presents the proceedings of the Commonwealth House of Representatives and Senate from 1901–1980 in a form that is optimised for browsing and reading. It includes additional features such as indexes to people and legislation, and the integration of tools for text analysis and annotation. Documentation is also provided.
Trove Harvester
Code and documentation to support the creation of large datasets for research and analysis from Trove’s digitised newspapers.
Gadfly front pages
Code and documentation to demonstrate how to harvest page images from Trove’s digitised newspapers.
Trove Proxy
Code and active proxy service that generates links to download PDFs from Trove’s digitised newspapers, and provides a https wrapper around the Trove API.
DIY Headline Roulette
Code and documentation that makes it easy for anyone to create their own simple game using Trove’s digitised newspapers.
Radio National program data
Updated dataset of programs broadcast on Radio National from 2000–2016 harvested from Trove.
PMs Transcripts repository
Repository of more than 20,000 XML transcripts of speeches by Australian Prime Ministers harvested from the PMs Transcripts site.
UMA Ellis Photos
Repository of data and images from a collection of political photos by John Ellis held by the University of Melbourne Archives. Harvested using the Trove API.

2016 — the talking

8 November 2016 – Digital research seminar and workshop at Griffith University.
10 November 2016 – Presentation as part of the ‘Access and innovation’ panel at Digital Directions 2016, National Film and Sound Archive, ‘Caring about access’, <https://dx.doi.org/10.6084/m9.figshare.4229402.v1>.
Featured on dh+lib (digital humanities and libraries).
19 October 2016 – Keynote presentation at Forging Links, Australian Society of Archivists conference, Parramatta, ‘Turning the inside out’, <https://dx.doi.org/10.6084/m9.figshare.4055013.v2>.
8 September 2016 – Invited presentation to Data and libraries: harnessing the possibilities, ALIA URLs seminar, Canberra, ‘Slow data? Small data? Exploring human-sized alternatives in the big data deluge’.
26 August 2016 – Keynote presentation at Migrant (R)e-Collections, Lorenz Centre, Leiden, ‘A life reduced to data’.
Featured as an ‘Editors’ choice’ by Digital Humanities Now.
19 August 2016 – Keynote presentation at Working History, Professional Historians’ Association conference, Melbourne, ‘Telling stories with data’.
29–31 July 2016 – GovHack Heritage Node, University of Canberra
18 July 2016 – Keynote presentation at internal staff conference, State Library of Victoria.
15 July 2016 – Invited presentation at DigitalGLAM Symposium, University of Melbourne, ‘Hacking heritage: power and participation in digital cultural collections’.
Featured on dh+lib (digital humanities and libraries).
5 July 2016 – Interview on ABC 936 (Hobart) Evenings about Trove.
24 June 2016 – Interview on ABCRN afternoons, ‘The fight to save Trove’
22 June 2016 – Paper at DHA2016 Conference in Hobart, ‘Closed Access’
3 June 2016 – Invited contributor to the REMIX Sydney panel ‘Blurring the digital and the physical: How can we add new layers to history?’.
2 June 2016 – Invited presentation at the Digital Research Methodologies Forum, La Trobe University, ‘The revolution will not be digital’
31 May 2016 – Invited presentation on digital research methods to the Deakin Contemporary Histories Research Group.
10 May 2016 – Presented two workshops on ‘Digital Tools and Techniques for the Adventurous Historian’, organised by the History Council of South Australia for the SA History Festival 2016, Adelaide.
7 March 2016 – Interview on ABCRN Late Night Live, ‘Treasure Trove under threat’
6 March 2016 – Contributor to panel discussing the government’s Innovation Agenda, Electronic Visualisation and the Arts Australasia, University of Canberra.
1 March 2016 – Interview with ABC 666 (Canberra) Drive about #fundTrove.
1 March 2016 – Interview with RTRFM (Perth) on #fundTrove.
19 February 2016 – Invited presentation to symposium on Commonwealth Department of Immigration – Then and Now, La Trobe University, ‘Digital perspectives on the archives of immigration’
12 February 2016 – Keynote presentation to ANZREG 2016 (Ex Libris Australia and New Zealand Regional User’s Group), Melbourne, ‘Linked Open Data’.
10 February 2016 – Invited contributor to the Gale Cengage sponsored Digital Humanities Panel at VALA2016, Melbourne.

Caring about access

Tim Sherratt — Sat, 12 Nov 2016 00:08:13 +0000

Contribution to a panel discussion on ‘Access and Innovation’ at Digital Directions 2016, Canberra, 10 November 2016.
View the slides.

Please cite as: Tim Sherratt, ‘Caring about access’, presented at the Digital Directions, 2016. <https://dx.doi.org/10.6084/m9.figshare.4229402.v1>

You might have seen that the Department of Prime Minister and Cabinet has opened discussion on an Open Government National Action Plan. Last week I was watching the Twitter stream from a briefing event in Melbourne, when I saw this tweet from Asher Wolf quoting the PM&C spokesperson:

. @pmc_gov_au: we don’t want ppl to search for flaws or try to crack gvt datasets. #OGPAU

Now I’m not sure of the context of this statement, but it is a reminder that we can’t take the meaning of words like ‘open’ or ‘access’ for granted.

They are what we make of them.

I’m a historian and hacker. I don’t steal credit card details, I use digital tools to open cultural heritage collections – to see them differently, to feel differently.

Earlier this year I wrote a script to harvest many gigabytes of parliamentary proceedings – Hansard – from the ParlInfo database maintained by the Department of Parliamentary Services. I shared the files, covering the period from 1901 to 1980, through my GitHub repository. My plan was simply to make this rich source of political history easily available to anyone who wanted to explore new forms of digital analysis.

In the end I created my own version of Historic Hansard, one that privileges the experience of reading rather than searching. I also added a few nifty features such as indexes of legislation and people, and integration with text analysis and annotation tools.

But in the process of harvesting the data I noticed that around 100 sitting days were missing from ParlInfo. The files were empty – unable to be searched. Given an average of about 50 sitting days a year, that’s about two years worth that were simply invisible.

Thanks to the Parliamentary Library these problems have now been fixed. But this experience highlighted two issues. First, search interfaces lie. We have to develop our critical capacities to be more aware of what we are not being shown and why. And second, this problem only became visible because I was hacking their database, because I went beyond what was offered by their website, because I wanted more. Unlike the PM&C spokesperson I think one of strongest reasons for opening data is to encourage people to find its flaws. This is particularly the case for cultural heritage data where the processes of selection, representation, normalisation, control, and preservation all have an impact of the types of stories we can tell about ourselves.

Access is not something that cultural institutions bestow on a grateful public. It’s a struggle for understanding and meaning. Expect to be criticised, expect problems to be found, expect your prejudices to be exposed. That’s the point.

If cultural institutions want to celebrate their website hits, celebrity visits, or their latest glossy magazines – well that’s just fabulous. But I’d like them to celebrate every flaw that’s found in their data, every gap identified in their collection – that’s engagement, that’s access. We need to get beyond defensive posturing and embrace the risky, exciting possibilities that come from critical engagement with collection data – recognising hacking as a way of knowing.

In this new post-truth world it’s going to be more important than ever to challenge what is given, what is ‘natural’, what is ‘inevitable’. Our cultural heritage will be a crucially important resource to be mobilised in defence of complexity, nuance, and doubt – the rich and glorious reality of simply being human.

A few years ago Merete Sanderhoff, from the National Gallery of Denmark, compiled a collection of essays on opening cultural heritage collections for reuse online. It was called Sharing is Caring. I think it’s worth reflecting on the dual meanings of ‘caring’ – both looking after, and giving a shit. Sharing helps us foster communities and preserve collections. But it also matters, it has an impact, it can change the world.

Bethany Nowviskie, the Director of the Digital Library Federation, recently gave a talk on ‘speculative collections’ asking how we shift the temporal orientation of our libraries and archives away from a closed and linear past towards an exploration of what might be.

Seeing differently. Feeling differently.

This is a visualisation created by Geoff Hinchcliffe showing more than 12,000 tweets since February this year using the hashtag #fundTrove. Trove has radically changed our access to the past. It’s not perfect, it’s not everything, but it changes lives. And yet the government seems to think it can be left to wither, and nobody will really care. I care. Do you? Then what do we do about it?

I also care deeply about the collections of the National Archives – a love expressed through many painful hours spent hacking RecordSearch, their online database. Last year I harvested details of more than 12,000 publicly available ASIO files – you can grab the results from my GitHub site – and downloaded about 300,000 page images. These are mostly surveillance files, documenting the lives of writers, academics, unionists, Indigenous activists and others – identified as potential threats to the nation.

This year I wrote a script to search those images for redactions – sections of text blacked out for security reasons. If you’d like to explore the 239,000 redactions I’ve extracted, you can – marvel at their blackness, their very lack of information. But here the redactions are not dead ends, they’re starting points, ways of exploring the workings of ASIO. As the power and reach of state surveillance continues to expand, we can find creative ways to reverse the panoptic gaze.

We don’t have to accept what we’re given. We can take collections and turn them inside out.

The National Archives also preserves remnants of Australia’s racist past. There are many thousands of records documenting the workings of the White Australia Policy. They include special certificates, with portrait photos and handprints, which were needed if you were deemed non-white, travelled overseas, and simply wanted to come home.

About five years ago I harvested about 12,000 images of these certificates from RecordSearch and ran them through a facial detection script. In a little over a weekend I’d created the Real Face of White Australia – a seemingly endless scrolling wall of more than 7,000 faces. It’s compelling. It’s uncomfortable. But this is who we are.

As I saw the One Nation senators cracking open the champagne outside Parliament House yesterday, I thought again of how important its is the tell these stories, to share these collections. To give a shit.

Innovation can be measured in more than shiny apps, or cool new visualisations. By struggling with access to our past we can imagine new futures.

Turning the inside out

Tim Sherratt — Mon, 24 Oct 2016 08:57:56 +0000

Keynote presented at Forging Links, the Annual Conference of the Australian Society of Archivists, Parramatta, 19 October 2016.

Please cite as: Tim Sherratt, ‘Turning the inside out’, presented at the Australian Society of Archivists Annual Conference, Parramatta, 2016, <https://dx.doi.org/10.6084/m9.figshare.4055013.v1>.

This is RecordSearch, but not as you know it.

This is my hacked version of RecordSearch, the online collection database of the National Archives of Australia. Unlike the regular version it displays the number of pages in each file. But more interestingly, if you search in Series ST84/1 you see more than metadata – you see the people inside.

As Barbara Reed noted in her article ‘Reinventing access’, ‘records are imbued with people’. Series ST84/1 goes by the fairly benign title, ‘Certificates of Domicile and Certificates of Exemption from Dictation Test, chronological series’. But of course the Dictation Test was the administrative backbone of a racist system designed to exclude people who did not fit the widely-accepted vision of ‘White Australia’. ST84/1 is full of people just trying to live their lives under the weight of the White Australia Policy.

The certificates in ST84/1 allowed people, born or resident in Australia, to return home after travelling overseas. If your ‘whiteness’ was suspect and you had no certificate, you would be subjected to the Dictation Test, and you would fail. The certificates usually include photographs and handprints – they are compelling and confronting documents. But you have to dig through layers of metadata in RecordSearch to see that. Or do you?

About five years ago, Kate Bagnall, a historian of Chinese Australia, and I were thinking about ways of drawing attention to these records. In a little over a weekend, I harvested about 12,000 page images from ST84/1 and ran them through a facial detection script. The result was ‘The Real Face of White Australia’.

You may have seen it before. It’s had a remarkable life, travelling around the world as an example of how we can use digital tools to see records differently. But of course the power is in the faces themselves – in the connections we make through time. We cannot escape their discomfiting gaze.

You may think that the certificates in ST84/1 are merely a form of identity document. But remember than in the early years of the 20th century, passports were still evolving, and the use of photographs and fingerprints for identification were generally confined to prisoners and criminals.

The sociologist Richard Jenkins talks about identity not as an essence, a noun, but as ‘something that we do, a process of identification’.1 But self-identification is constrained by broader systems of categorisation, or ‘social sorting’, that decide who belongs, who is a threat, who needs to be watched.

The records in ST84/1 were embedded within a system of surveillance that extended outwards from Australia’s ports to the offices of shipping companies around the world, and inwards to anyone who seemed out of place in White Australia. Technologies of identification and surveillance do not simply enforce boundaries, they create them. Their existence demonstrates why they are needed. These records did not document identity, they defined it according to a set of racial categories.

Modern parallels are not hard to find. Last year in a bungled operation that became known as ‘#borderfarce’, immigration officials planned to prowl the streets of Melbourne on the hunt for illegal immigrants. The focus of border surveillance once again turned inwards, to those who seemed out of place. Watching in horror as events unfolded on social media, I helpfully pointed people in Melbourne to a convenient source of identity documents.

In Melbourne this weekend? Get your Border Force identity papers here: http://t.co/H3dlXHZy6q #borderfarce #wap pic.twitter.com/j2TrkKEPkR

— Tim Sherratt (@wragge) August 28, 2015

The link I tweeted was not to RecordSearch, but to an experimental interface where Kate and I are continuing to think about ways of exposing the bureaucratic remnants of White Australia.

So far I’ve harvested metadata from more than 20,000 files and downloaded around 150,000 page images. Amongst other things, I’m working on an updated wall of faces.

Most recently I created a way of sorting and viewing pages by their orientation – by the ratio of height to width. Why? Kate wanted an easy way of finding birth certificates which, in this period, tended to be short and wide. It was a simple little hack, but it revealed the records in a very different way.

There’s a big bunch of birth certificates, but there’s also envelopes, photographs, and an assortment of slips and notes. It makes you think about how we see the written world through the frame of ‘portrait’ orientation. It’s perhaps also worth noting that RecordSearch displays thumbnails as cropped squares – shape is subsumed to the regularity of the grid.

The records in our landscape view are no more accessible than they were before, but they can be accessed differently. In his contribution to the ASA’s 30th anniversary symposium, Eric Ketelaar described the relationship between archival access and human rights – highlighting the importance of access not only to democratic accountability, but also to our rights as ‘victims’ of official surveillance: ‘As human beings, subjected to the panoptic sort of governments and private enterprise, we have the right to know’.2 Ketelaar concludes by stating that ‘access is not the actual use of archives, access enables use’. I’d like to extend this a bit. What my own work reveals is the complex relationship between access and use. Not only does access enable use, use changes what we mean by ‘access’.

My bio nowadays describes me as a ‘historian and hacker’. ‘Historian’ describes my orientation to the world – I see the past in the present. ‘Hacker’ refers to the tools I use to make connections through time. Hacking is creative and positive, despite what the mainstream media might say – it’s about finding solutions, exploring alternatives, and pushing the limits of what’s possible.

My RecordSearch hack is just a little piece of Javascript code that you can install in your browser – it changes the pages as they load.

Userscripts, as they’re called, allow anyone to alter the way webpages look and behave within their own browser. They give users greater control over their online experience, but they also open opportunities for experimentation. A lot of my own research is guided by questions like: What would happen if I…? What would it look like? What would change? What would I feel? Userscripts are one way of playing with the complexities of access.

Surveillance too can be hacked. If you’re concerned about technologies such as facial detection you’ll be pleased to know that thanks to the work of artist Adam Harvey, not only can you confuse detection algorithms, you can make a dramatic fashion statement.

Gary Marx, a major figure in the field of surveillance studies, has catalogued the ways in which individuals can resist the growing encroachments of surveillance. Amongst possible tactics he identifies ‘discovery’ – the attempt to undercover the scope of surveillance.3 Access to records can empower such acts of everyday resistance, but in other ways surveillance and access are more alike than opposed. Both start from a place of concealment. Access cannot be given unless it is first restricted. Both depend on asymmetries of power. Decisions about what we can know are ultimately made by others. Access is as much a process of control, as it is an act of release.

This is not necessarily a bad thing. I think we can all agree there should be limits to access, particularly relating to individual privacy and cultural sensitivity. But just as ‘identity’ is defined through acts of ‘identification’, so access is elaborated through instances of deployment and use. Access is not a state of being, it’s a process to be negotiated. And so the question is, what can we know about what we can know?

On 1 January this year I harvested the metadata of all the files in RecordSearch with the access status of ‘closed’.

These are records that the access examination process has determined should be withheld from public scrutiny. While the files themselves can’t be seen, RecordSearch does tell us a fair bit about them, including when the decision was made and the reasons behind it. Unfortunately you can’t search or filter on this data, so it’s difficult to look for patterns within RecordSearch itself.

I’ve taken the data and loaded into a new site where you can examine it from a number of different angles. You can explore the reasons why files were closed, the series they came from, the age of their contents, and the dates when decisions were made. It’s more of a workbench than a discovery interface, and it’s likely to change as I ask different questions of the data.

Of course, the outlines of the examination process, including the grounds for exemption, are defined by the Archives Act. So what more is there to know?

Section 33 of the Archives Act does indeed spell out 17 reasons why records can be withheld from public access. But the data from RecordSearch includes an additional 11 categories. Some, like ‘Parliament Class A’, relate to other definitions under the Act. Others, like ‘MAKE YOUR SELECTION’, tell us something about the RecordSearch interface. But two of the most heavily cited reasons – ‘Pre-access recorder’ and ‘Withheld pending advice’ are not defined under the Act or anywhere that I could find on the Archives’ website.

Being an archivally-educated audience you can probably guess what these labels refer to, but if you need a little help you can look at when the access decisions in these categories were made.

The majority of decisions on ‘pre access recorder’ files were made before the introduction of the Archives Act in 1983. I checked this with the Archives and they confirmed that these records were examined before the existence of the Act. They explained that ‘pre access recorder’ was used when the original exemptions couldn’t be mapped to those later defined under Section 33.

Conversely, most decisions on ‘Withheld pending advice’ files were recorded in the last five or six years. If you look at the series that contain the most files citing this reason you can see that almost half come from A1838 – DFAT’s main correspondence series. As I’m sure you’ve realised, these are files that have been referred back to agencies for advice. And DFAT has been particularly slow in responding. They’re listed as ‘Closed’ on RecordSearch even though their access status has not been finalised. They’re not, however, included in the count of ‘closed’ files that the Archives reports in its annual summary of access outcomes.

This is probably fair enough, but if you search the closed files you can see that 1,467 of them files have been waiting for more than three years for a final decision. They might not be officially closed, but for a PhD student wanting to see them they are effectively closed.

My point here is not to be critical of the National Archives, or even of DFAT. What I’m interested in is the inevitable gap between legislation and practice. Access examination is subject to a range of influences and constraints, resourcing amongst them, and needs to be understood not as the application of a set of rules, but as a process that is historically contingent. A human process.

Earlier this year I harvested several gigabytes of parliamentary proceedings from the Australian Parliament’s ParlInfo database and created my own version of Historic Hansard.

As you do.

In the process of harvesting the files I discovered that data was missing for about 100 sitting days – most of them from the Senate between 1912 and 1919.

There’s no conspiracy at work here, it’s just some sort of processing error. However, Parliament staff weren’t aware of the problem, and it’s unlikely that anyone would’ve noticed using the web interface. You can’t find what you can’t find. Fortunately, Parliament staff are now working on a fix, but if you’ve been relying on ParlInfo for access to debates relating to World War I, you might want to do some more checking.

These things happen. Systems go wrong. Mistakes are made. Again what interests me is not finding who’s to blame, but exploring the gap between design and outcome, between ideal and reality. This is the gap where access is made and experienced. A gap that can only be understood through the complexities and contradictions of use. Access does not exist until its limits are tested. It’s not a process of opening, it’s a constant ongoing struggle over the very meaning of ‘open’.

And that’s a good thing.

We’re here at this conference to explore the possibilities of ‘forging links’. But of course collaborations don’t have to be comfortable to be constructive. The struggle over access may sometimes be tense, frustrating, and annoying, but it is also productive. Users of archives do not just consume access, they create it.

About a year ago I fired up my RecordSearch harvester and downloaded the metadata and page images for most of the ASIO (Australian Security Intelligence Organisation) files publicly available through the National Archives. I ended up with about 70gb of images. These are mostly dossiers on individuals and organisations – odd collections of gossip, published articles, and records of surveillance.

I’ve made this data available for anyone who wants it. Some of the images were recently used in the GovHack open data competition to create the ‘Cute Commies’ site.

To be honest, I didn’t have a clear purpose in mind when I harvested the data. It was another one of those ‘What would happen if?’ moments. I was, however, thinking generally about possible points of comparison between the ASIO files and the archival remnants of the White Australia Policy. Both built systems of identification, classification, and surveillance in which recordkeeping was crucial.

Kate and other historians of Chinese Australia have noted that the administration of the White Australia Policy was not uniform or consistent. Similar cases could result in quite different outcomes depending on the location and those involved. Understanding this is important, not only for documenting the workings of the system, but for recovering the agency of those subjected to it. Non-white residents were not mere victims, they found ways of negotiating, and even manipulating, the state’s racist bureaucracy. In her work on colonial archives, Ann Laura Stoller identifies this ‘disjuncture between prescription and practice, between state mandates and the manoeuvres people made in response to them’ as part of the ‘ethnographic space’ of the archive.4

How do we explore this space? One of the things I’ve found interesting in working with the closed files is the way we can use available metadata to show us what we can’t see. It’s like creating a negative image of access. Kate and I have been thinking for a number of years now about how we might use digital tools to mine the White Australia records for traces, gaps, and shadows that together build a picture of the policy in action. Who knew who? Who was where and when? What records remain and why?

The workings of ASIO, on the other hand, are deliberately obscured. Many of the files in the Archives include a note explaining why details have been withheld. Some warn that the ‘public disclosure of information concerning the procedures and techniques used by ASIO’ would enable people of interest to formulate counter-measures ‘based on an analysis of ASIO modus operandi’. David Horner’s recent history of ASIO notes that he was required to remove ASIO file references from his footnotes ‘because of the nature of ASIO’s filing system, which itself is classified’.5 We don’t even know how many files ASIO has on people and organisations, although David McKnight suggests that it’s somewhere in the hundreds of thousands.6 My harvest includes about 12,000 files.

Just like systems of racial classification, intelligence services exist within a circle of self-justification. The fact they exist proves they need to exist. We are denied information that might enable us to imagine alternatives. And yet as limited as the provisions under the Archives Act are, we do have access.

How can we use this narrow, shuttered window to reverse the gaze of state surveillance and rebuild a context that has been deliberately erased. Just as with Closed Access and the White Australia records can we give meaning to the gaps and the absences? Can we see what’s not there?

This is one of the questions being explored by Columbia University’s History Lab. They’ve created the Declassification Engine – a huge database of previously classified government documents that they’re using to analyse the nature of official secrecy. By identifying non-redacted copies of previously redacted documents, they’ve also been able to track the words, concepts and events most likely to censored.

The History Lab’s collection of documents on foreign policy and world events is rather different to ASIO’s archive of the lives, habits and beliefs of ordinary Australians. But I’m hoping that they too can tell us something about the culture that created them.

I’d intended to have a wonderfully compelling suite of examples and arguments to demonstrate today, but time has run short. Instead I have a set of half-baked experiments which sort of look a bit interesting. But perhaps that’s better. It’s important to me to try and be open about my own processes. I share my code and data, and I’ve started documenting most of what I’m up to in a open research notebook. If access is a struggle, then we should be sharing our stories of loss and frustration, and not merely celebrating our victories.

All of these experiments are online in some sort of form. So please explore.

Experiment A is nothing more than a browse interface to all the digitised records I’ve harvested. It’s just a clone of my work with the White Australia records, but I think there’s real conceptual power in the ability to browse.

Experiment B started with a problem. From RecordSearch I could harvest data on access status and find out how many ASIO files were in each of the three categories – Open, Open with Exception, and Closed. But how much of the ‘Open with Exception’ files are actually open?

Most of the files include a summary which tells you how many pages have been completely or partially exempted. That’s great, but did I really want to open up 12,000 files and manually scan for summaries? By playing around with the Tesseract OCR engine I’ve created a simple filter that extracts text from the images and searches for words like ‘exemption’, ‘archives’, and ‘folio’. I now have a good sized collection of summaries awaiting data entry…

Experiment C began as another attempt to quantify the scale of exemption. The summaries told me how many pages had redactions – bits of information like names and ids that are blacked out, or sometimes even cut out of the page. But if I could identify individual redactions I could both test the summaries and create a new measure of openness… or redactedness…

Looking for redactions in ASIO files — that looks hopeful… pic.twitter.com/heeWMAQiwe

— Tim Sherratt (@wragge) September 11, 2016

Through trial and error I developed a computer vision script that did a pretty good job of finding redactions – despite many variations in redaction style, paper colour, and print quality. It took a couple of days to work through the 300,000 page images, but in the end I had a collection of about 300,000 redactions. Unfortunately about 20 percent of these were false positives, so I spent a number of nights manually sorting the results.

But the false positives themselves are sort of wonderful… pic.twitter.com/Jn9l04bUaY

— Tim Sherratt (@wragge) September 29, 2016

My redaction finder still needs a lot of refinement, and plenty of errors have slipped through. But, within the files that are currently digitised, the scale of exemption seems about ten times greater than Margaret Kenna estimated when giving evidence to Parliamentary Joint Committee on ASIO in 2000. She thought every file contained about 10 exemptions ‘be it a word or a folio or a paragraph’. I’m seeing an average of about 100 redactions per file.

I’ve started adding information about the size and position of the redactions to my database and aggregating this data by page. When I left Canberra, the script was still running, but you can explore the current standings in my top 50 lists of the most redacted files and pages.

Once the data processing is completed you’ll be able to filter files by the amount of area blacked out, or the total number of redactions. Many more opportunities to see what you can’t see.

Experiment D was an attempt to build a composite image of all the redactions to visualise what parts of a page were most likely to be be removed – something like a heatmap. It sort of worked, but by the time I’d added all the redactions I had nothing but a very large black blob.

My composite of 170,000 #ASIO #redactions has opened a portal to another universe… pic.twitter.com/39iQewjst4

— Tim Sherratt (@wragge) October 8, 2016

A rethink is required…

Experiment E had two aims. First to highlight the visual character of the redactions themselves – there’s a strange sort of beauty in a massed collection of blobs. Secondly, just as with the Real Face of White Australia, I wanted to turn the files inside out. Instead of being dead ends, I wanted the redactions to be discovery points, signposts, ways of exploring the files.

It’s online now so play. You can view a random sample of redactions, or browse page by page through the entire collection.

Talking about her own ASIO file in the book Dirty Secrets, the politician and academic Meredith Burgmann noted that the ‘blacking out process seems totally arbitrary and for the reader terribly frustrating, like reading a detective novel with the last page torn out’.7 But in hunting for redactions I found they could also bring moments of unexpected joy. It seems that someone got a bit bored and has left us with a glorious collection of redaction art.

So what’s to come? I need to rework my redaction finder to improve its accuracy.

It’s interesting, and perhaps ironic, that the removal of information has given me an identifiable data point that I can potentially track against other characteristics of the files. Can I identify patterns by time or topic?

Apparently ASIO assessments have become less conservative over the years – I can test this by looking at changes in redaction rates over time.

I also want to explore the context of redactions. By expanding the window around redactions and OCRing the result, I hope to identify the words that occur most commonly appear near redactions.

Those of you coming to the workshop on Friday will hear more about some of the tools and technologies I’ve used in these experiments. But I wanted to give a brief overview today because this is access.

Digital tools and technologies give us the opportunity to use databases like RecordSearch as archaeological sites to sift through layers of metadata in search of new connections and meanings. This is access.

We can turn digitised collections inside out, revealing the people, the processes, the structures, the form. This is access.

We can reveal the processes through which records are controlled, concealed, and withheld. This is access.

Access is not a deliverable or a product. It’s a struggle for understanding and power – not just to see, but to see differently.

This is RecordSearch but not as you know it.

Experiment F is a userscript that puts the redactions back into RecordSearch. Access is an honest acknowledgement of its own limits, and an invitation to push beyond.

Richard Jenkins, ‘Identity, surveillance and modernity: Sorting out who’s who’, in Kristie Bell, Kevin D. Haggerty and David Lyon (eds), Routledge Companion of Surveillance Studies, Routledge, 2014, p. 159.
Eric Ketelaar, ‘Access, the democratic imperative’, Archives and Manuscripts, vol. 34, no. 2, November 2006, p. 73.
John Gilliom and Torin Monahan, ‘Everyday resistance’, in Kristie Bell, Kevin D. Haggerty and David Lyon (eds), Routledge Companion of Surveillance Studies, Routledge, 2014, p. 407.
Ann Laura Stoller, Along the Archival Gran: Epistemic Anxieties and Colonial Common Sense, Princeton University Press, Princton, 2009, p. 32.
David Horner, The Spy Catchers: The Official History of ASIO, Allen & Unwin, Sydney, 2014, p. 581.
David McKnight, ‘How to read your ASIO file’, in Meredith Burgmann (ed.), Dirty Secrets: Our ASIO Files, NewSouth, Sydney, 2014, p. 38.
Meredith Burgmann, ‘The secret life of B/77/26 (and friends)’, in Meredith Burgmann (ed.), Dirty Secrets: Our ASIO Files, NewSouth, Sydney, 2014, p. 454.

A life reduced to data

Tim Sherratt — Thu, 25 Aug 2016 12:52:31 +0000

Keynote presentation to the Migrant (Re)Collections workshop, Leiden, 2016.

In 1861, the census for the colony of New South Wales (as it was back then) recorded just one Chinese woman living in Balmain in Sydney. The historian Eric Rolls, writing in 1992, commented that this ‘lone woman is exceptional and inexplicable’.

Inexplicable? My partner and collaborator Kate Bagnall is a historian of Chinese Australia and she recently investigated this case again, making use of digitised resources that were not available in the 1990s.

Her starting point was one tiny fragment in a digitised newspaper article on Trove. A report from the Water Police Court published in 1863 notes ‘the case of Ah Happ, a Chinese woman, who claimed the sum of £8 9s 6d. wages for her services as nurse in the employ of Cyril Cecil, of Snail’s Bay, Balmain’. The case was dismissed, but this brief, tantalising reference gave Kate enough information to trace the life of Ah Happ over the next 20 years or so, until she disappears from the records again. Kate now believes Ah Happ was the first Chinese mother in NSW. But was she the woman in the 1861 census?

The census has been big news in Australia recently, and not for the right reasons. I think the correct technical term for the handling of the 2016 census is omnishambles – there have been multiple failures both in communication and technology.

It all started to go wrong when the Australian Bureau of Statistics quietly announced that it would be keeping everybody’s names for longer than usual and using them to generate identifiers that would link the census with other government datasets. This might not seem so controversial on its own, but of course context is crucial.

Over the last few years there have been multiple reports of the misuse of personal information by a variety of government agencies. Nonetheless the amount of information being gathered has increased. In 2015, for example, new laws were passed for the retention of metadata documenting the communications of all Australians. The census, previously a trusted tool for government planning, suddenly seemed a further creeping, encroachment on individual privacy.

As a historian with an interest in the politics of surveillance I had mixed feelings about it all. Concerns about data matching are well justified, but the census constitutes a vital historical resource – often documenting lives that are barely glimpsed through other sources. The controversy also overshadowed the fact that since 2001, a growing number of Australians had agreed that their name-identified census data could be preserved by the National Archives of Australia for release to researchers in 99 years. In 2011, more than 60% of Australians willinging added their details to the so-called Census Time Capsule. How much of that trust will have now been lost?

We’re here to talk about questions of identity; to find better ways of matching records about people across historical datasets. So I think it’s important to think about the how these datasets came to be created. We are implicated in debates such as those that surrounded the recent Australian census. In some cases we are the beneficiaries of systems created for the surveillance and control of suspect populations. Time changes, but does not dispel, questions about our responsibilities to those we seek to identify.

Back in 2010 I wrote a blog post entitled ‘I link therefore I am’. The National Library of Australia had recently established a service called People Australia which brought together a range of biographical sources, disambiguated the names of people and organisations, and minted persistent identifiers for each new aggregated identity. The service still exists as part of Trove. People Australia was also, I think, the first of the Library’s online services to offer a public API.

This coupled with the development of Linked Open Data made me pretty excited. People Australia presented new opportunities to link resources across collections, but I was particularly interested in the possibilities for ordinary web users. After all, RDFa meant anyone with a web page could make Linked Data. There was also some activity at the time around machine tags – a sort of semantically-enriched tagging. Thanks to the vision of Aaron Straup Cope, machine tags were incorporated into Flickr in about 2009. They’re still there – just…

To try and bring some of these threads together I created a simple web service that took a name and returned a snippet of nicely formatted, RDFa enriched, HTML that you could drop into your blog post or web page. A bookmarklet made the markup process relatively seamless. Alternatively, you could ask for machine tags that could be cut and pasted into services like Flickr. Instant Linked Open Data for the masses.

I gave a talk about all this to a group of librarians and challenged them to use my identity finder thing (I’ve never been good at naming things) to add machine tags citing People Australia identifiers to photos on Flickr – to unambiguously identify either the subject or creators of the photos. To encourage them further I created the Flickr Machine Tag Challenge, an interface that enables you to explore photos tagged with National Library identifiers. This was, I think, a very early example of crowdsourcing Linked Data.

With all this swirling around it was perhaps inevitable that questions of identity would figure prominently when Kate and I started to think about what to do with the large quantities of records held by the National Archives of Australia documenting the operations of the White Australia Policy.

For the non-Australians here – yes, when the Australian colonies came together to form a nation, it was generally agreed that the nation would be white, and that this would be achieved through the control of immigration. Of course the substantial Indigenous populations were conveniently ignored in all this. The Immigration Restriction Act was passed by the first Australian parliament in 1901.

But what about the thousands of non-white people already resident in Australia – Chinese, Japanese, Malay, Syrian and more. They were allowed to stay, but their movements in and out of Australia were monitored – they had to carry special papers that would exempt them from exclusion. Many thousands of these certificates are preserved in the Archives, documenting in ironic detail the lives of people who weren’t supposed to be part of White Australia.

In 2010, Kate and I launched Invisible Australians – a project without any sort of funding or institutional support – aimed at drawing attention to these records. You might have seen one of our experiments – The Real Face of White Australia. It’s a simple scrolling wall of more than 7000 faces extracted from the Archives using a facial detection script. It was a weekend hack that has been cited around the world. But the power, of course, is in the faces themselves – they confront us with the reality of Australia’s racist past.

However, one of the main aims of Invisible Australians was to give names to those faces – to extract data from the archives that would enable us to link these tiny biographical fragments and follow people through time. I’m about to have another look at this based on recent developments in crowdsourcing software – something like the Zooniverse’s Measuring the ANZACs project would do a lot of what we wanted. But the project won’t be the same as we imagined it back then. I’ve grown increasing uncomfortable with what it means to identify people.

In November last year, Mark Matienzo, the Director of Technology at the Digital Public Library of America, gave a paper in which he raised important questions about Linked Open Data and the ‘power to name’. Like Mark I think we have an obligation to consider the contexts in which we create, recover, or aggregate identities. There is power in the process and we need to understand where it comes from and the violence it can do.

The question of identity was critical to the operations of the White Australia Policy. You might think that the certificates carried by non-white residents were nothing more than identity papers – an early form of passport. But the point is, only non-white Australians had to prove who they were. Moreover, the technologies used to determine identity – portrait photographs and handprints – were strongly associated with the management of criminal populations. Indeed, in 1911 one Chinese businessman objected to being treated ‘just like a criminal’. The process of identification helped justify the racist underpinnings of the system – the management of this suspect group required special measures.

The taint of suspicion followed non-white residents through their daily lives. The Immigration Restriction Act created the category of ‘Prohibited Immigrant’ to describe those who were present in Australia illegally. Kate tells the story of one unfortunate cook in Melbourne who was arrested by Customs Officers and accused of being a prohibited immigrant, mostly because he seemed a bit too Chinese. He was forced to prove who he was and how and when he had come to Australia. This was something of a challenge as he always believed he was born in Australia, and had grown up in Chinatown after being orphaned at a young age. This story seemed all too relevant when Australia’s immigration officials, now called ‘Border Force’, announced last year they would patrol the streets of Melbourne in a crackdown on visa fraud. Instead of ‘prohibited immigrants’ we now seek to identify ‘illegal maritime arrivals’. And in the US, Donald Trump wants to introduce ‘extreme vetting’.

I can’t now look at our wall of faces without wondering about the uses of facial detection. There are easily available web APIs that will not only tell you if an image contains a face, but whether the face is smiling, its gender, and its race. Both Google and Facebook have claimed frightening levels of accuracy with their facial recognition technologies – not just finding faces, but matching them against a set of known identities. In Australia, a number of image databases are to be linked to create a new facial recognition service called – The Capability. Our faces are increasingly not our own – they are public signifiers to be captured by systems of identity management and surveillance.

This is the context in which we undertake our explorations of identity, in which we exercise our power to aggregate, and to name. We can of course turn these systems on themselves, in the way that the residents of East Germany claimed the Stasi archives as their own. There are a number of examples where archives of oppression have been reclaimed in the struggle for justice. But we have to make that decision and engage accordingly. There is no neutral position.

For me this means finding better ways of representing the uncertainties of identity. Technologies of surveillance construct identity as an aggregation of data points – matches, crossreferences, and hits. Linked Data tends to work the same way, mapping the points of connection across multiple datasets. We know the points do not make the person, but we use them to create a shell identity, and therein lies the challenge. How do we fill that shell with the complexities and contingencies of life, without losing Linked Data’s ability to make meaningful and reusable connections.

As I mentioned, my early experiments with Linked Data were aimed at building simple tools that would give creators of content the power to enrich their work with structured data. Nowadays we have tools like Pund.it and Hypothes.is that allow us to build layers of annotation and enrichment on top of existing web resources. We also have platforms like Scalar and the forthcoming Omeka-S that give us the ability to define relationships between resources within the context of interpretation. These sorts of tools help us bridge the gap between the data we collect and the stories we can tell.

But it should be easier. Over the years I’ve made a few attempts to combine historical narrative with Linked Open Data – all of them buggy and incomplete. But it’s a project I keep coming back to for a number of reasons.

Firstly, historians create Linked Data all the time, they just don’t realise it. In the process of their research they build complex entity-relationship models, linking people, places, events and resources. But when it comes time for ‘writing up’, the data gets squeezed out to fit with the conventions of linear narrative and print publication. We need new publishing paradigms that maintain the relationship between narrative and data and expose full richness of historical practice.

Second, one of the things that has always attracted me to Linked Open Data is the idea that anyone can create it, anywhere. Embed some RDFa, reuse some identifiers, and hey presto – you’re publishing data about the world that can be aggregated and explored. At least in theory. In practice, developments in Linked Open Data seem to have been centralised around particular tools and institutions, or geared towards search engine optimisation. And yet, on the other hand, we continue to create specialised crowdsourcing platforms to foster public engagement with our cultural collections. Why not get better at sharing identifiers for collection items and support the development of simple Linked Open Data tools that can be used wherever content is created. Every blog post could become a collection portal.

And finally, because as much as I enjoy playing with data, stories are what really matters. Stories convey meaning and emotion in ways data cannot. They give us room to explore nuance and uncertainty. They make us human. I want to find better ways of enriching data with stories, and vice versa.

In 1908, James Minahan was declared a ‘prohibited immigrant’ and arrested. James had been born in Australia to a Chinese father and Irish-Australian mother, but when he was just five years old his father took him to live in China. When he returned 26 years later James spoke no English. How could somone born in Australia be a ‘prohibited immigrant’? The authorities argued that James’s connection to the country of his birth had been lost – culturally and racially, he could not be considered ‘Australian’. But the case was hardly clear cut and eventually ended up in the country’s highest court.

Kate has been researching the story of James Minahan for a number of years and has assembled a complex story of people, place, and law, drawn from the holdings of many cultural institutions. This time we’re not going to let the data be squeezed out. Currently a draft of the story sits in a demonstration system I built using JSON-LD and AngularJS.

It works well enough, but I’ve decided to throw out most of the code and start again. Why? I feel that I’ve been focusing so much on the interface that the simplicity of the system has been compromised. I’ve been playing by Angular’s rules and that makes me very uncomfortable. I’ve also been inspired by the work that’s been going on in DH around the idea of minimal computing – creating tools that are both simple and sustainable; that demand little technical infrastructure, but build capacity for innovative digital research.

Ed., developed by Alex Gil and his team is a wonderful example – beautiful both in its aims and execution. Ed. is a theme for Jekyll, the static site generator, that makes it easy to publish scholarly editions of digital texts. I’ve decided to build a set of plugins and examples that will make it possible to add a Linked Open Data layer to a text in Ed. The result will just be HTML – easy to publish, easy to preserve. Once I’m happy with the basics, I’ll think again about adding some Javascript trickery to the interface. At it’s simplest the data can be stored in a hand-edited YAML file. Not a triple-store in sight. There’s something strangely liberating about working with Jekyll.

There have been a lot of mentions of ‘sustainability’ in the past couple of days. For me, sustainability isn’t just about funding, or institutional support, or governance structures – it’s about building things that can be hacked, reused, shared, and fixed.

Perhaps it’s through systems such as this we can encourage and support the small-scale production of Linked Open Data, based not on machine learning or entity extraction, but on detailed research and individual expertise. The James Minahan story will not only explore the complexities of identity and belonging, it will map relationships between people, trace journeys through space, and provide a specialised subject guide to Australia’s cultural heritage collections. At least that’s the plan…

Perhaps we can find new ways of bringing together the microstories, that Marijke and others have mentioned, with the big pictures drawn from heritage data – of navigating changes in scale without losing sight of what matters.

And perhaps this mix of story and data can help us deal with the politics of identity more effectively – to share the power of naming, to provide space for uncertainty, to undermine the authority of those who seek to reduce us to a collection of data points.

I suppose you’re wondering about Ah Happ. Was she the woman in the 1861 census? Unfortunately the dates don’t quite match up, so it’s hard to be certain. But given what Kate now knows of Ah Happ’s history, the presence of the Chinese woman in Sydney in 1861 is hardly inexplicable. She was probably a domestic servant or a nursemaid. A data point left unnamed, but a person not unknown. Questions of identity are rarely that simple.

Telling stories with data

Tim Sherratt — Thu, 25 Aug 2016 12:22:58 +0000

Keynote presented at Working History, the Professional Historians’ Association Conference, 19 August 2016, Melbourne.

There’s a video of the talk here, and slides here.

Friends, I come bearing good news. For I have seen the future of history.

Indeed I saw it here in Melbourne in February. There it was in the exhibition hall of the VALA2016 conference…

THE FUTURE OF HISTORY

…is a very fancy book scanner.

It’s easy to get caught up in the hyperbole – to imagine that history has been swept up in a technological revolution.

I must admit, I’ve often argued that access to more than 200 million digitised newspaper articles through Trove has profoundly changed the practice of history. I still believe that. It’s not just about convenience – the fact that you can now do your research at home in your pyjamas. It’s also about making the fragile slivers of ordinary human experience accessible in a way that just was not possible before. It’s given us the ability to tell different types of stories.

But is it a revolution? As historians I think we have an obligation to be sceptical of the ‘R’ word – we all know that for every change there is a continuity. And that’s what interests me in the digital space, the interplay between historical practicisibe and technology; between new possibilities and old critiques.

Fear not – you will not be rendered obsolete by army of sentient book scanners. But perhaps there are different ways we can work, different questions we can ask.

Discovery against the grain

So, Trove.

I’m assuming in audience such as this I don’t need to explain what Trove is. Of course Trove is much, much more than just digitised newspapers, but let’s just focus on those newspapers for a minute. What we’re talking about is:

More than 200 million newspaper articles from 1803 onwards
More than 1000 different newspaper titles – not just the metropolitan dailies; rural and regional papers; political, religious, and community papers; papers in a range of different languages.
Terabytes of text – all searchable

I think we tend to take this last point for granted – not only can you explore 200 million newspaper articles online, but you can search inside them. I think there is real revolutionary force in the combination of OCR (Optical Character Recognition) and keyword search. It shifts power away from the headline writers, the cataloguers, the archivists and the indexers, and opens an infinite number of pathways for discovery. I’m sure all of you have a story about some tiny fragment, a clue that was critical to your research, buried deep in a seemingly uninteresting newspaper article. This was always possible if you had enough time and physical access to newspapers, but technology has normalised this mode of exploration. We are no longer fossickers, scouring past workings in the hope of finding a gem. Whether we know it or not, we now all dig deep along unmapped seams of history and meaning.

But this is only one example of how discovery is moving beyond metadata – the information we record about resources – to mine the very content of those resources, to create new access points. Newspapers are easy, what about handwritten resources like letters and diaries? Technology has enabled new forms of collaboration between institutions and researchers, allowing them to create and share transcriptions of otherwise unsearchable documents.

As of last week, for example, volunteers had transcribed more than 16,000 manuscript pages from the papers of English philosopher and reformer Jeremy Bentham. What’s particularly interesting about the Transcribe Bentham project, is that the transcriptions are themselves being used to train new tools for automated handwriting recognition. Humans are teaching machines, who will in turn help humans to open resources to new forms of discovery. There’s still a way to go, but the possibilities are pretty exciting. What will happen when instead of consulting an index to a large collection of correspondence, we can search the complete content?

And it’s not just text.

Cultural institutions have been experimenting with colour as a way of navigating large image collections. The colour values of the images themselves are extracted, aggregated and normalised, allowing users to discover connections otherwise unknown or undocumented. Companies like Google are using artificial intelligence to generate descriptions of images. This technology might enable us to search millions of uncatalogued photographs held by cultural heritage institutions. Like switching on the lights in a darkened room, the application of computing power will help reveal stories lurking in the shadows of anonymity.

Assuming of course that they have been digitised.

The seeming omnipotence of Google has encouraged us to equate discoverability with existence – if we can’t find it, it doesn’t exist. But only a small proportion of our cultural heritage collections have actually been digitised. Even as we take advantage of these new technologies we have to think about what’s missing. Whose history has been digitised, described and processed, and why?

This graph shows the number of digitised newspaper articles in Trove by year. As experienced, professional historians you will of course know why there is a peak around 1914. More news…? More newspapers…? The answer is more money for digitisation. With the centenary of World War I approaching, it was decided to focus resources on the digitisation of newspapers from that era. This is completely understandable, but also largely invisible to users.

At the recent digital humanities conference in Hobart, Tim Hitchcock talked about the hidden histories that shape the online resources on which we now depend. We are working with collections of text, he argued, that are ‘inherently, and institutionally, Western centric, elitist and racist’. With dollars for digitisation becoming increasingly scarce we can expect more ‘project based’ funding responding to particular events or anniversaries. Will this simply reinforce the limits of our digital horizons?

We need to understand that we start our digital explorations from a point of privilege and exclusion. Decisions have been already been made about what and who matters. But within these limits we can continue to work against the grain, to flip perspectives, to see things differently.

About five years ago I worked out how to do bulk downloads of digitised files from the National Archive of Australia’s online database RecordSearch. My partner Kate Bagnall, a historian of Chinese-Australia, and I were particularly interested in the many thousands of records they hold documenting the workings of the White Australia Policy. So I downloaded about 12,000 pages, many including portrait photographs, and ran them through a simple program to find faces. The result was a scrolling wall of about 7,000 portraits that we called ‘The Real Face of White Australia’. The remnants of a racist bureaucratic system were turned inside out – instead of files and metadata you could see the people inside.

It’s these sorts of possibilities for seeing against the grain that get me most excited about digital technologies. We no longer have to accept what we’re given by cultural institutions, we can build and share new perspectives, even new interfaces.

One of my current projects is exploring the records held by the National Archives of Australia that we’re not allowed to see – those with an access status of ‘closed’. Anyone who’s used the National Archives knows that the access examination process can sometimes be a bit frustrating. But it is just a process – there’s nothing magical or mysterious about crossing the threshold from closed to ‘open’.

In this case there’s obviously no files for me to download, but there is data documenting when and why a file was closed. By harvesting that data from RecordSearch and feeding it through a new interface we can start to build up a picture of how that process works – we can investigate access itself as a historical phenomenon.

Working at scale

More than a decade ago, the pioneer digital historian Roy Rosenzweig talked about the challenges of digital abundance – how would historical methods change to deal with the vast quantities of digital content becoming available?

What does it mean when you search for something in Trove and find you have 10,000 matching results, or maybe 100,000, a million? We’re used to working in a linear fashion, interrogating our sources one at a time. So how can we extract understanding from a set of resources that we can never hope to examine individually?

It was this question that inspired me to create QueryPic, a simple tool for visualising searches in Trove’s digitised newspapers. You enter your keywords in the usual fashion, but instead of a list of search results you’re presented with a chart that shows you the number of articles each year that match your query. Instead of just the top twenty results, you see everything at once. Using QueryPic you can combine multiple queries to track changes in language or technology, or observe the impact of particular events. (In this case we’re comparing the use of the name ‘Santa Claus’ vs ‘Father Christmas’). It’s easy to create, save and share charts – have a go!

There are many similar tools around these days. Google’s Ngram viewer enables you to construct complex queries across the contents of millions of books. Bookworm lets you explore trends in a range of sources including US newspapers and the scripts of the Simpsons.

Most of these tools work by treating texts as data – by breaking texts down into their component parts, and then analysing the occurrence of particular words, phrases, or other patterns. Digital historians are lucky. Literary scholars have been working for decades on the computational analysis of text, and the tools they’ve created can be readily applied to historical sources.

Voyant Tools, for example, is a web-based text analysis platform. Feed it text files, web pages, XML, even PDFs, and it will slice and dice the language of your sources, opening the results for further exploration through a series of interactive tools. It’s powerful enough to handle many megabytes of data. In this it’s analysing the talk that I’m giving right now.

If you want to start somewhere simpler, have a play with DataBasic.io, where you can learn the fundamentals of text analysis by diving deep into the lyrics of Beyonce.

QueryPic generates large scale pictures using Trove’s digitised newspapers, but what if you want something more fine-grained? I’ve also created a Trove Harvester that will save the details of all newspaper articles matching a particular query to your own computer. Perhaps you want to look for patterns in the language of all articles that include the phrase ‘White Australia’. Just grab the contents of the articles using my harvester and upload them to Voyant. Bam!

This form of analysis is often termed ‘distant reading’. Instead of examining individual documents we use computational methods to look for patterns across a large collection of documents. By zooming out, we can explore the historical record at different scales, finding new connections and meanings.

Have you come across the Old Bailey Online? It’s an astonishing resource, providing fully-searchable text of nearly 200,000 criminal trials from 1674 to 1913. But of course it’s not just text. The data is structured so you can search by name, crime, verdict and punishment. Zooming out of individual trials, historians can examine changes in the way the legal system itself operated over time. Similarly, the Prosecution Project is compiling data about Australian criminal trials from a variety of sources, including Trove. Who knows what we might learn about the nature of our criminal justice system.

The data gathered by projects such as these allow distant reading not just of language, or institutions, but of populations. Who were these people whose lives intersected with the administration of the law and the operations of the state?

The Digital Panopticon project is now linking up records from a number of these different databases to track people from the Old Bailey, through transportation to Australia, and beyond.

One of the things I love about this sort of work is that the data is so richly and profoundly human. In this age of so-called ‘big data’ there’s a tendency to imagine that a focus on computational methods will somehow firm up the scientific credentials of the humanities. We have data too! Indeed, Google’s Ngram viewer was announced to the world as as the flag bearer of a new field called ‘culturomics’ that would bring statistical rigour to the historical study of language and culture.

Yes, it’s bullshit. Big data is made up of many small acts of living. And life, as we know, is messy and complicated, and resists our attempts at categorisation. That’s what makes history so much fun. As historians we have the opportunity, and indeed the obligation, to tease out the connections between the micro and the macro; between the unkempt trajectories of individual lives and the beautiful curves of our data visualisations.

Click any point on a QueryPic chart and you’ll see the first twenty matching results from Trove. QueryPic’s visualisations are not arguments, they’re starting points – ways of exploring ideas, or surveying new territory. Understanding comes from shifting scales – moving between individual articles and long term trends.

One of the things that excites me most about digital tools and techniques are the opportunities they create for navigating these changes in scale. We can create new resources and interfaces that bring together statistics and stories; that enrich our data with the power of narrative, and vice versa.

Doing it in public

A couple of years ago I started to collate information about websites that included links back to digitised newspaper articles on Trove. I wanted to understand more about the contexts in which the newspapers were being used and cited. The diversity of subjects and sites was astonishing, and sometimes disturbing. But what was particularly interesting was the amount of historical work some people were doing.

KnowThatProperty.com sounds a bit like a commercial real estate site, but in fact it provides potted histories of houses around Sydney, largely drawn from Trove. For example, the entry for number 2 Carrington Street in Strathfield, provides details of the house’s construction and ownership from 1888 to 1927, with over 40 links to newspaper articles in Trove. The creator of the site is a web developer with an interest in architectural history. But is he a historian?

Who cares?

Ready access to historical sources through services such as Trove allow people to pursue their passions, around and beyond the demands of everyday life. We are no longer subject to the tyranny of the microfilm reader. The work of history – the chasing down of connections, the exploration of context, the compilation of references – is no longer confined to designated places of research. Nor is it expressed solely in traditional forms of historical production.

Digital tools help us find things, but they also help us share them. We write blog posts, we collect on Pinterest, we repost on Tumblr, we ‘like’ on Facebook. You might think that this is all a bit trivial, and sometimes it is, but the dynamics of sharing help us to look at history differently. The ‘public’ are no longer external to the process of history making – an imagined audience, or prospective consumers. We’re just all in there together.

Who’s made a list on Trove? Trove lists are just collections of interesting items. You create lists on particular topics and use them to keep track of relevant resources. You might create and share and share a list of newspaper articles relating to your family’s history for instance. But, once shared, a list becomes something more than a convenient bucket of content. Lists provide thematic entry points that aid discovery by creating implicit links between items. They’re also building blocks for new forms of access.

Last year I created a web application that takes the contents of Trove lists and turns them into online exhibitions. Using freely available web services, you can create your own exhibition in minutes (yes, really).

I’m sorry, but I really don’t care about questions of authority or professional identity. The guy who created KnowYourProperty.com isn’t waiting for the historians of Australian to pin a membership badge on him. Nor, I suspect, is the bloke who has created more than 200 Trove lists about lawnmowers worried about whether what he’s doing is really history. I happen to think it is, but more importantly it’s about passion. It’s about just doing something. In a world that champions consumption over creation, conflict over collaboration, we should celebrate any effort to make an authentic connection to the past.

There’s something quite liberating about the fluidity of the digital environment. Instead of trudging the well-worn path from research to product we can explore the possibilities of reuse; we can experiment with form and meaning; we can play; and we can feel.

The Vintage Face Depot is a Twitter bot. Tweet a picture of yourself to it and you will receive back a modified you – your face will be blended with one drawn at random from a collection of faces extracted from Trove’s digitised newspapers. It sounds creepy and I suppose it is, but the effect can be quite interesting, sometimes unsettling. What happens when you see your own eyes peering out of a face from the past? The image is accompanied by a link to the original article on Trove, so you can find out more about who you’ve been blended with. Like a lots of my work, the Vintage Face Depot was made quickly as an experiment. I still don’t know what to think of it. And that’s good.

Caleb McDaniel’s Twitter bot, @every3minutes is a more deliberate intervention in our experience of the past. Historians of the American slave trade have estimated that a person was sold every three minutes between 1820 and 1860. So Caleb’s bot tweets, every three minutes – ‘someone just purchased a black person’s grandchild’, ‘a white slaver just sold a person’s friend’. It’s unrelenting and confronting.

We are finding new forms of historical expression that don’t merely visit the digital realm, they live there.

Show your working out

Earlier this year I sort of accidently created and shared my own version of Commonwealth Hansard. It’s a site that lets you browse the proceedings of the House of Representatives and the Senate from 1901 to 1980. I was originally intending just to harvest the data that sits underneath Hansard on the Australian Parliament House website. I thought all that nicely-structured text would provide an interesting dataset to poke around in using the sorts of text analysis programs I’ve already described. But after downloading about 4gb worth of data, I starting thinking about other things I could do with it.

If you’ve used Hansard on the Parliament site, you’ll know that it’s hard to read anything in context – you’re constantly navigating your way up and down a confusing hierarchy of debates and speeches. So I decided to make a version of Hansard that was focused on reading – one sitting day per page. That’s it.

But that simplicity has allowed me to do other things. Every year, and every sitting day, has a button that automatically opens the proceedings for that period in Voyant. That simple button turns a page of text into a laboratory for the historical analysis of political speech. I’ve also integrated Hypothes.is which allows anyone to annotate the text – adding notes, links, highlights, even images. Annotations created with Hypothes.is can be shared globally, turning each page of text into site for collaborative research.

Both Voyant and Hypothes.is can be added to any web page with just a couple of lines of HTML code. It makes you wonder why we are still recreating traditional forms of publication online, when we could be doing so much more, so easily.

You would have noticed that this set of slides itself embeds a number of graphs, visualisations, and live web pages that you can play with inside the presentation. The creators of Voyant, Stefan Sinclair and Geoffrey Rockwell, believe it is important to create analytical tools, or hermeneutica, that can be embedded within works of scholarly interpretation. Voyant’s widgets encourage readers to play with the data and not merely consume the argument. They give power to readers to build their own interpretations, to make their own discoveries.

In a similar way, the historian Tom Griffiths has described footnotes as ‘generous signposts to anyone who wants to retrace the path and test the insights’. Footnotes too give power to the reader, but in a non-digital environment the ability to exercise that power is deferred, perhaps indefinitely. How can make sure those ‘generous signposts’ actually point somewhere?

Perhaps you’ve heard of a thing called Linked Open Data – it’s really just a way of publishing nicely structured data on the web so that it can be easily connected up across sites and collections. Historians create Linked Open Data all the time, they just don’t know it. Think about all those spreadsheets or index cards you have listing people, linking them to other people, places, events, and documents. In LOD terms these are all nodes and edges – entities and relationships.

Historical research frequently involves creating these sorts of complex data models. They represent a huge investment of skill, knowledge and experience. But what happens when we come to ‘write up’ our research? The data is squeezed out of the narrative, flattened down to comply with the conventions of linear storytelling. The connections are severed.

Kate and I are playing around with ways of combining historical narrative and Linked Open Data to make sure that the story remains in conversation with the data – to give readers the freedom to jump off at any time into the underlying network of people, places, events, and resources. And if those people, places, events, and resources are themselves linked to the holdings of libraries, archives, and museums, then every piece of writing becomes a gateway – a starting point for further exploration of our cultural collections. Generous signposts indeed.

Our latest LODBook experiment is available online. It’s buggy and incomplete, but gives a sense of what we want to do. I’d originally intended to have a nice, finished version to show you today, but I decided recently to throw out a lot of code and start again. Why? I always wanted something that was simple and sustainable – a set of practices and reusable components, rather than yet another publishing platform. Having been recently inspired by work going on around the idea of minimal computing, I decided I needed to focus more on the basics. So watch this space.

Digital tools and technologies give us the opportunity to experiment, and sometimes to fail. As I was harvesting text from Hansard I noticed a few oddities. After further investigation I realised that more than 90 Senate sitting days were missing from Parliament House’s online database, including about half of 1917. It’s unlikely anyone would have noticed the problem using the web interface unless they were looking for a specific date. Historians researching the World War I period just don’t know what they’ve missed. Fortunately the people at Parliament House are now investigating to see what they can do.

These things happen. Systems are never perfect. But when they do happen it’s important to talk about them. This is a great example of why we need to remain critical of search as a means of access – it can’t find what’s not there.

The story of the Senate black hole is documented in my open research notebook – it’s where I post notes and experiments relating to my current research projects. It’s an idea I’ve stolen from a number of colleagues working in digital history, including Caleb McDaniel, and I think it’s a good example of how the digital environment can encourage us to re-examine our practices.

The non-digital world privileges products as markers of achievement – things we can count, things we can launch, things we can sell. A conference on ‘working history’ seems like an ideal place to challenge that, to think about how we can use digital tools and techniques to expose more of the labour, the craft, the practice of history. To focus on the doing, not the done.

In 2008, the American historian William G Thomas suggested that ‘digital history should embrace the impermanence of the medium, use it to convey the changing nature of the past and of how we understand it’. The digital future is full of clamour and distraction, an overwhelming array of possibilities. Much like the past.

Our experiments, our incomplete thoughts, our works-in-progress, our failures, reflect the confusion and uncertainty we can never escape. By exposing them online, we are simply admitting what we’ve always known. History is constantly in the process of being made.