Playing with pipes

The ever-informative Twitter alerted me recently to the History Trust of South Australia’s object of the month. It made me think that it would be nice if there was some way of bringing together all those objects, photos and documents featured by our cultural institutions. Some sort of combined RSS feed perhaps?

Something like this…

Well, yes… I couldn’t resist having a go. My tool of choice for this was Yahoo Pipes which has various modules for manipulating and creating RSS feeds. Check out my script on the Yahoo Pipes site to create a badge like this, play some more or inspect its innards. If you’re feeling adventurous you can even clone the script and tinker away yourself – it’s the best way to learn.

At the moment the script aggregrates content from the Flickr photostreams of:

  • National Archives of Australia
  • State Records NSW
  • State Library of NSW
  • State Library of Queensland
  • State Library of South Australia
  • Australian War Memorial
  • Powerhouse Museum

These are mixed up with the contents of the Powerhouse’s ‘Object of the week‘ blog and the NAA’s ‘Find of the Month‘. I’m happy to add more sources – leave your suggestions below.

Most of it was ridiculously easy. I just added the RSS feeds from Flickr and the Powerhouse blog, then fed them through a module to sort them into date order. ‘Find of the month’ was trickier because there was no existing RSS feed – time for some screen-scraping! First I scraped a list of the urls for 2009, then for each month I pulled out the title and date, as well as the first paragraph to act as a description, and the first image. Then I turned all these bits and pieces into an RSS feed and joined it up with the rest.

Yahoo Pipes makes this sort of thing simple, even for non-coders. Interestingly, too, it’s not just a matter of creating an RSS feed – as you can see Yahoo Pipes emits the data in a variety of formats. You can subscribe to the RSS feed, create a badge or slurp up the data in JSON to power some new application.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tim Sherratt Written by:

I'm a historian and hacker who researches the possibilities and politics of digital cultural collections.

7 Comments

  1. Amanda French
    September 11, 2009
    Reply

    I’m curious: how does one “screen scrape”? I always hear about it but never know how it’s done. Nice work, btw!

  2. September 11, 2009
    Reply

    Amanda – Screen scraping is the often frustrating process of trying to extract structured data from a web page. In this case, Yahoo Pipes returns a web page as a long string of text, which you can then cut up into useful pieces using regular expressions. In other cases you might be working with the page’s DOM and using XPath expressions to find the elements you want — or a combination of XPaths and regexp.

    Here’s an example where I used PHP and XPath to build an RSS feed. Here’s another using Python.

    The wonderful Programming Historian site has lots of useful information for any aspiring screen scraper. Of course, screen scraping is also what powers many Zotero translators and there’s plenty of useful info in this tutorial.

  3. September 21, 2009
    Reply

    Amanda – Tim is right screenscraping is frustrating and painful. I liken screenscaping html and extracting structured data to trying to turn a hamburger into a cow ! For a couple of data sources that contribute to the National Library of Australia’s People Australia program I’ve used two linux apps, wget and tidy to get html documents and turn them into well formed ‘xhtml-like’ documents. I’ve then created (often complex) XSL Transformations to extract the data and output EAC records in XML format. I’ve then used php to apply the transformations to thousands of files which can then be harvested. Painful, worth it but … wouldn’t it be nice if people used standard record formats and standard protocols for exchanging information ??? If you’d like further info please let me know: bdewhurs at nla dot gov dot au

  4. January 17, 2010
    Reply

    Hi Tim,

    Just had a go at adding your pipes rss badge to the provcommunity website and it worked beautifully. Have taken it off because I wanted to ask you if this is okay to do?…and because I would love to tinker with the script, save it then embed it again…i have no programming skills but think i might be able to do it. Do I just follow the link to the pipes url you give, click on clone and then tinker, save and embed. Sounds too easy?

    cheers
    asa

  5. January 17, 2010
    Reply

    Asa – Yep, that’s basically it. Let me know if you strike any probs, and yes of course you’re welcome to use, embed, tinker, whatever – that’s why I did it.

    Actually, I need to do some tinkering myself to fix the NAA Find of the Month link. Also, annoyingly, the NAA has started putting non-collection photos in its photostream. Bah.

  6. January 17, 2010
    Reply

    Hi Tim,

    Saved a pipe in which I added the PROV flickr stream to the rss feeds. The title of the pipe is the same as yours. Not sure what the pipe file name is…couldn’t see any ‘save as’ option. When I run the pipe the prov stream shows up fine. When I embed the badge code on the provcommunity ning…the PROV images don’t show up…even the title of the pipe isn’t the same…will do it again now so you can see what I mean. Not sure what I’m doing wrong? Cheers, Asa.

Leave a Reply

Your email address will not be published. Required fields are marked *