A few interesting data projects

The contagious data bug must be sweeping through the office, as several very different but very interesting data-driven publishing projects rolled out almost simultaneously.

First, infographics editor Paddy Allen explains the global recession through a very elegant interactive piece “Where did all the money go?“. Paddy has quite a collection of brilliant work from his interactive infographics such as the Energy-hungry houses piece to his storytelling through interactive visualization like the map of Heathrow’s planned 3rd runway.

Second, a strong team led by editor David Leigh has begun posting their investigations into “The tax gap,” a study of tax avoidance by big business.

“It has taken a team of specialists more than three months and involved checking scores of trademark registers and sets of company accounts in Luxembourg, the Netherlands, Switzerland and Ireland.”

One of the many ouputs of the investigation is the raw data that is informing some of the work, such as the interactive guide to corporate tax. For example, you can see what British Airways has reported paying compared with what is notionally due against their stated profits. The information is available in XML format, such as this year-by-year feed.

british-airways-tax-gap-guardiancouk

Third, and this is my personal favorite, the Football guys have outdone themselves with a new feature called Chalkboards. The Guardian’s head of sport Ben Clissett explains:

“No football debate will ever be the same again – it’s not about opinion any more, it’s about facts. And our chalkboards give you the ammunition to settle the argument. You can also compare two players side by side – if you want to compare Robbie Keane and Steven Gerrard in the same position for Liverpool, or Michael Essien and Mikel John Obi for Chelsea.

And when you have built your chalkboard, you can save it and start a discussion with your mates simply by pressing the save button and explaining your point. You can also embed images you have created on your blog, and use the tool with social networking sites.”

For example, I can see clearly for myself that Aston Villa’s draw against Wigan on Saturday was not due to a lack of offense. They had 16 attempts on goal, in fact, 4 on target and 3 shots blocked. The level of detail is amazing. I can also see where the teams focused their passing during the game.



 by Guardian Chalkboards

This is the kind of data that typically only team owners and managers have access to. And even though the super fans can keep much of this in their heads, they can’t watch every game.

Now, perhaps the best part of this is the embeddable Chalkboard image. Since much of the Premier League discussion is happening in places all over the Internet, it makes sense to share the Chalkboards that both editors and users are creating both on and off guardian.co.uk.

Simple but very clever.

I love that each of these is so different. But there can be no doubt that data is starting to drive a lot of very creative approaches to the journalism process.

Building communities from Twitter posts

I spent a little time over the last couple of weeks playing around with some Twitter data. I was noticing how several people, myself included, were sharing the funny things their kids say sometimes:

So then I wondered whether there was a way to capture, prioritize and then syndicate the best Twitter posts into a ‘kiddie quote of the day’ or something like that.

My experiment only sort of works, but there are some lessons here that may be useful for community builders out there. Here’s what I did:

  1. Get the quotes: I ran some searches through Twitter Search and collected the RSS feeds from those results to create the pool of content to use for the project. In this case, I used ‘daughter said‘ and ‘son said‘. I put those feeds into Yahoo! Pipes and filtered out any posts with swear words. Then I had a basic RSS feed of quotes to work with.
  2. Prioritize the quotes: I’m not sure the best way to prioritize a collection of sources and content, but the group voting method may do what you want. Jon Udell has another approach for capturing trusted sources using Del.icio.us. For voting, there’s an open source Digg clone called Pligg. I set it up on a domain at Dreamhost (I called it KidTwits…Dreamhost has a one-click Pligg installer that works great) and then pumped the RSS feed I just made into it. In no time I had a view into all the Twitter posts which were wrapped in all the typical social media features I needed (voting, comments, RSS, bookmarking, etc.).
  3. Resyndicate the quotes to Twitter: While you might be able to draw people into the web site, it made more sense in this case to be present where interested people might be socializing already. First, I created a Twitter account called KidTwits. Then I took a feed from the web site and sent it through an auto-post utility called twitterfeed. Now the KidTwits Twitter account gets updated when new posts bubble up to the home page of kidtwits.com.
  4. Link everywhere possible: When building the feed into Pligg I made sure that the twitter ID of each post was captured. This then made it possible to “retweet” with their IDs intact. Thus, the source of the quote would then see the KidTwit posts in their Twitter replies. It works really well. People were showing up at the web site and replying to me on Twitter the same day I began the project.

    Again, I used Yahoo! Pipes to clean up and format the feed back out to Twitter to include the ‘RT’ and @userid prefix to each entry. I played around a bit before arriving at this format.

    I also included a Creative Commons copyright on all the pages of the web site to make sure the rights ownership issues were clear.

    Lastly, I added a search criteria for my feed collector that looks for references to KidTwits. This means people can post directly to the web site either by adding @kidtwits to their posts or #kt. There was already a New Zealand Twitter community forming who began using ‘kt’ to join their posts (short for kiwitweets), but they gave it up. I then had to filter out references to the kidtwits Twitter posts to avoid an infinite loop.

  5. Improve post quality: Now, here’s where things have been failing for me. I can’t think of better search terms to capture the pool of quotes I want, but there are so many extraneous Twitter posts using those words that it seems like I’m getting between 5% and 10% accuracy. Not bad, but certainly not good enough. The good news is that it’s pretty easy to kill the posts you don’t want through the Pligg interfaces. I just don’t have the time or desire to maintain that.
  6. Optimize the site: I then did a bunch of the little things that wrapped up the project. I added Google Analytics tracking, created a simple logo and favicon, customized the Twitter background, and configured Pligg to import the Twitter Search pipe automatically.

There are several things I like and a few I dislike about this little project.

  • I really like the fluidity of Twitter’s platform. It’s amazingly easy to capture and resyndicate Twitter posts.
  • I love the effects of the @reply mechanism. I can essentially notify anyone who gets their Twitter post listed on the home page of kidtwits.com without lifting a finger. And they get credit automatically for their post.
  • I already knew this, but Yahoo! Pipes is just brilliant. I can’t imagine I would have even considered this project without it.
  • Pligg is pretty good, too. It does everything I want it to do.
  • I would love to hand over the management of the voting and quality checks to someone else. Voting naturally invites gaming. At the end of the day, however, the quality control and community management function is what makes a community service interesting to people. You can’t automate everything.
  • I’m actually not a fan of voting approaches to prioritizing content. It will ultimately result in dumbing down the quality. That’s less of an issue for highly niched topics like this, though.
  • The rights issues are a little weird. This wouldn’t be a problem in forming a community whose purpose is noncommercial naturally. But I’m not sure the Twitterverse would respond well to aggregators that make money off their posts without their knowledge or consent. (To be clear, KidTwits is not and never will be a commercial project…it’s just a fun experiment.)
  • Auto-retweeting feels a bit wrong. I wouldn’t be surprised if the KidTwits account gets banned. But I have explicitly included the source and clearly labeled each Twitter post with ‘RT’ to be clear about what I’m doing. I’m not building traffic to my account, the web site, nor am I intentionally misrepresenting anything.
  • By adding “RT @userid” I’ve killed the first 10 or so characters of the post that I’m retweeting. This means the punchline is often dropped which kills the meaning of the retweeted post.
  • Some conversational Twitter posts get through which include @replies to another user. When the KidTwits retweet of that post goes out it’s very confusing.

The potential here, among other things, is in creating cohesive topical communities around what people are saying on Twitter. You can easily imagine thousands of communities forming in similar ways around highly focused interest areas.

In this method the community doesn’t necessarily have the typical collective or person-to-person dynamics to it, but the core Twitter account can act as a facilitator of connections. It can actually create some of the authority dynamics people have been wanting to see. It becomes a broker of contextually relevant connections.

In a very similar way the web site serves as a service broker or activity driver. It’s a functional tool for filtering and fine-tuning the community experience at the edge. The web site is not a destination but more of a dashboard or a control panel for the network.

The experiment feels very unfinished to me still. There’s much more that can be done to create better activity brokering dynamics across the network through the combination of a Twitter account and a web site, I’m sure.

Breaking through the attention barrier

For some reason I get a bit annoyed when people write about our information overload limits. This happened the other day when I saw Seth Godin’s piece titled “Warning: The internet is almost full.” He writes:

“The decentralized nature of the net means that it will never be physically full. As long as we can keep making hard drives, we won’t run out of space to store those inane videos of your Aunt Sally. What is full is our attention.”

I just refuse to believe that we’ve hit the ceiling of what the human brain can deal with.

There is no doubt that we have a lot of useless information available to us, much of it pushed at us, cluttering our lives in really irritating ways. But information overload is a symptom of some bigger issues that we can and should resolve.

I think it’s about better linguistics, technologies and education, to begin with. More broadly, it’s about how we collectively understand and apply abstraction layers to manage a more complex world.

Like everyone, I hit my attention limit nearly every day. Seth is right when he says “You can’t read every important blog… you can’t even read all the blogs that tell you what the important blogs are saying.”

That’s a reason to explore some more, not to give up. We shouldn’t become fatalistic about the future of information or look down our noses at all that messy stuff strewn about the Internet. I never want the flow of information to slow down or, worse, retract, no matter how much mess gets in the way of finding the stuff that matters to me.

What we may need are more dramatic changes in our language, more effective information discovery services, more experience-based education programs both for kids and adults, and, perhaps even more important than all that, an altered world view that can accommodate and make the most of the vast resources that are now part of our culture forever.

Hacking BNP data

Less than a week after trying out some new data mapping concepts at Guardian Hack Day, a big pile of data appeared on the Internet begging to be mapped. Using some of their new skills and a convenient constituency data tool, a small team of innovators got to work to produce some really interesting data-driven journalism.

Mat Wall details what happened behind the scenes:

“[Simon Willison] wrote a piece of code to extract the 12,000 BNP member’s postcodes through the They Work For You constituency API. Now he had a voting constitency for each person on the list. He then injected this data back into his hack day project to plot this information onto the map obtained from Wikipedia. This took about an hour.”

Update: The project didn’t end there, it turns out. The infographics guys used the concept for the newspaper the next day. Here’s a picture of what was printed:
Hack Day Map in The Guardian Newspaper

Notes from Hack Day at The Guardian

We hosted our first Hack Day last week at The Guardian. Amazing fun.

Here’s a 15min highlight reel:

We did a lot of the standard stuff that makes Hack Day so interesting, but there were a few innovations to the event format itself that I thought worked really well, too:

  1. DabbleDB. Simon Willison setup a simple hack submission queue using DabbleDB, a handy online database tool. It’s as if the software was designed for this purpose. Two nice benefits: 1) you can upload a screenshot with your submission which it displays nicely, and 2) it prints beautifully. I handed out a hardcopy of the hack demo queue for each judge who then used the list to take notes.
  2. Double Screens. We setup 2 projectors so we could jump back and forth between presentation locations and save some time. While one person was presenting, the next person was setting up on the other screen. I was a little worried it would be distracting, but that wasn’t a problem at all.

    I think this is primarily what kept the pace up. We got through 37 hacks in just about an hour. At that pace you couldn’t really afford to look away. Oh, and Simon’s lightning timer was hugely helpful, too.

    This then had the nice effect of giving the judges more time to deliberate…

  3. Comprehensive recognition. The judges went through every single hack and found a way to acknowledge each participant. Emily Bell did a sort of improv act dishing out the jokes. She first went through all the hacks that “we would have given an award to”. Then she handed out the trophies…
  4. The Guardian Hack Day TrophySilly trophies. These worked perfectly. You can keep it on your desk. It makes no sense to anyone else. And it reinforces the idea that the recognition is for the work itself, not for winning a competition. We did hand out a couple of Flip cameras and Make Magazine generously offered some free subscriptions for the hardware hacks, but the emphasis was clearly on the hackers and their hacks, not the idea of ‘winning’.

Otherwise, it seemed to operate much like other Hack Days, except for the refreshing focus on hacks that mean something. I wasn’t sure what kind of hack quality to expect which was in fact very high, but I loved the fact that most of the hacks had the added dimension of context.

Many times a Hack Day results in a lot of amazing technology solutions for problems that don’t exist. I would never challenge the value of creativity for creativity sake, as that’s a big part of what Hack Day is about. But I was really happy to see that in addition to the impressive technical hacks things like Ben Griffiths’, Rob McKinnon’s and Simon Willison’s hacks (to name a few) presented data and information in new ways that could influence the way people think about what they are reading or interacting with.

Anyhow, the event was fantastic, and I’m really looking forward to doing it again.