Open source grid computing takes off

This has been fun to watch. The Hadoop team at Yahoo! is moving quickly to push the technology to reach its potential. They’ve now adopted it on one of the most important applications in the entire business, Yahoo! Search.

From the the Hadoop Blog:

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

  • Number of links between pages in the index: roughly 1 trillion links
  • Size of output: over 300 TB, compressed!
  • Number of cores used to run a single Map-Reduce job: over 10,000
  • Raw disk used in the production cluster: over 5 Petabytes

I’m still trying to figure out what all this means, to be honest, but Jeremy Zawodny helps to break it down. In this interview, he gets some answers from Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development):

The Hadoop project is opening up a really interesting discussion around computing scale. A few years ago I never would have imagined that the open source world would be contributing software solutions like this to the market. I don’t know why I had that perception, really. Perhaps all the positioning by enterprise software companies to discredit open source software started to sink in.

As Jeremy said, “It’s not just an experiment or research project. There’s real money on the line.

For more background on what’s going on here, check out this article by Mark Chu-Carroll “Databases are hammers; MapReduce is a screwdriver”.

This story is going to get bigger, I’m certain.

Freebase.com is hot

I don’t get a chance to review products often enough these days. But when I heard about Freebase I knew I needed to dive into that one as soon as I was able.


Fortunately, I was invited only yesterday to take a peak. And I’m officially joining the hype wagon on this one.

Someone once described it as Wikipedia for structured data. I think that’s a good way to think about it.

That image leaves out one of the most powerful aspect of the tool, though. The pivot points that are created when a piece of data can be interlinked automatically and dynamically with other pieces of data creates a network of information that is more powerful than an edited page.

The Freebase screencast uses the movie database example to show this. You can dive in and out from actor to film which if you wanted could then carry on to topic to location to government to politician to gossip and on and on and on. And everything is editable.

Now, they didn’t stop at making the ultimate community-driven relational database. They exposed all the data in conveniently shareable formats like JSON. This means that I could build a web site that leverages that data and makes it available to my site visitors. I only need to link back to Freebase.com.

But that’s not all. In combination with the conveniently accessible data, they allow people to submit data to Freebase programmatically through their APIs. They will need to create some licensing controls for this to really work for data owners (NBA stats data and NYSE stock data, for example). But that’s getting easier to solve, and you can see that they are moving in that direction already.

Here’s a brief clip of the screencast which shows some other interesting concepts in action, too:

Suddenly, you can imagine that Freebase becomes a data clearinghouse, a place where people post information perhaps even indirectly through 3rd parties and make money or attract customers as others redistribute your data from the Freebase distribution point. They have a self-contained but infinitely scaleable data ecosystem.

I can imagine people wanting to manage their personal profile in this model and creating friends lists much like the typical social network except that it’s reusable everywhere on the Internet. I can imagine consumer goods producers weaving coupons and deals data with local retailer data and reaching buyers in highly relevant ways we haven’t seen yet.

Freebase feels very disruptive to me. I’m pretty sure that this is one to watch. And I’m not alone…

Michael Arrington: “Freebase looks to be what Google Base is not: open and useful.”

Jon Udell: “Freebase is aptly named, I am drawn like a moth to its flame.”

Tim O’Reilly: “Unlike the W3C approach to the semantic web, which starts with controlled ontologies, Metaweb adopts a folksonomy approach, in which people can add new categories (much like tags), in a messy sprawl of potentially overlapping assertions.”

John Markoff: “On the Web, there are few rules governing how information should be organized. But in the Metaweb database, to be named Freebase, information will be structured to make it possible for software programs to discern relationships and even meaning”

In some ways, it seems like the whole Web 2.0 era was merely an incubation period for breakthroughs like Freebase. Judging by the amount of data already submitted in the alpha phase, I suspect this is going to explode when it officially launches.

Gatekeepers need to stop calling themselves gatekeepers

Time business columnist Justin Fox questioned the success of the new media methods in a recent post “The reign of the enthusiasts“.

He suggests the algorithms that proudly surface the deep dark corners of the Internet are actually just self-referential popularity contests. When searching for his name Justin found that the articles he’s written that are likely most influential in the real world fail to rank higher than the articles he’s written which attracted the most link love from media-obsessed blogger types, like myself.

“There are web2topians out there–Battelle and my friend Matt McAlister immediately spring to mind–who are convinced that the Googles (and Diggs and del.icio.uses and Amazons and Last.fms) of the future will do a vastly better job of steering people to what they want, such a good job that most of the gatekeepers of the current media universe will prove wholly extraneous.”

This isn’t the first time someone has accused me of being a Web 2.0 blogger. Coincidentally, the same day Justin posted this, I was mocked by a local construction worker waiting for the bus with his buddies as I passed on my way to the office. He shouted to nobody in particular,

“Man, you know what I hate? Dotcommers.” He watched me walk by stonefaced and waited for a response. The guys standing around him turned to look. Unsure still, he blurted out, “Architects, too. Hate all of them.” He got the laugh he was looking for.

Jeez, am I that boring? Or that obvious and annoying? (Please don’t say anything. I think I know the answer.)

Anyhow, Justin’s question is top-of-mind for a lot of people in the media business. Where I disagree with him and the wisdom of the media industry crowd is on the notion of “gatekeepers” or rather the need for them at all.

Perhaps the most important part of being successful in media is distribution, and the reason we’re asking what the role of the gatekeeper is today is because the Internet has disintermediated the media distribution models that helped them become gatekeepers in the first place.

Online search changed the way people access relevant information, and those who once thought of themselves as gatekeepers suddenly found themselves at the mercy of the link police, the new gatekeepers, the search engines.

Yet, Justin’s explanation of the weakness of Google’s algorithm is exactly what I think many people who get mocked for their trendy glasses, old man sport coats, carefully orchestrated facial hair events, designer shoes and man purses (I don’t have a man purse) all see improving with the introduction of explicit and implicit human data into the media distribution model. The act of hyperlinking to a web page is not a strong enough currency to hold together a market of information as big as the Internet has become in recent years. It’s a false economy.

But the link currency opened the door to the idea of using behavior to help people find things. I love Last.fm not just for the music it recommends to me but because it proves this to be true. The Internet is made of people, people with a wide range of knowledge, tastes, and interests.

Now, there will always be a role for experts, and there are many cases where being an expert is not just subjective. Experts are hugely influential on the Internet as they are in other media. But I don’t see that a gatekeeper is an expert by definition.

There will also always be a role for enablers. Good enablers are often community builders who understand the rhythms of human psychology and emotion. Henry Luce was such a man, and I think he might have been a very successful web2topian today.

If those who call themselves “gatekeepers” want to share their expertise in valuable ways, then they will need to understand how the role of human data helps with distribution of that expertise. If those who aim to be enablers of communities want to be relevant, they will find ways to do that in many of the social technologies that have proven successful in this new world.

Similarly, if the people Justin affectionately refers to as web2topians appear smug, glib or arrogant when talking about media, then they are only doing themselves and everyone in the business a disservice. Gatekeepers know better than anyone that expertise does not by definition make you important. That’s a lesson the Internet generation will learn the hard way when someday they become irrelevant, too, I’m sure.

Answering the Answers question

It wasn’t until someone much more tapped into pop culture than I am told me that the Yahoo! Answers product was cool that I considered it to be true. I didn’t get it at first. I wondered, “What’s the incentive to contribute? Maybe it works for kids. And when was the last time Yahoo! launched a cool product of its own anyway?”


Photo: Mr. Mark (reclining buddy)

I still don’t understand the incentive to answer questions, but despite that I’m amazed at the responses to the questions people post.

First, I love some of the philosophical dialog in the system. Deepak Chopra appeared in Answers with a question, and the answers were fantastic:

Q: “What do you think the role of individual transformation is in manifesting world peace?”
A: “…The question to me is not the role of individual transformation in manifesting world peace; but can mankind agree upon what the symbol of peace represents and if so, how might this further or progress all mankind’s evolution of the psychobiotic self.”

Second, it’s very social in a new kind of way. It’s like walking through a festival where you jump into a conversation with totally random people without any awkward formalitites. You ask a question, hear what people have to say and move on having a new perspective to take with you.

I asked one question about the need for our educational system to teach personal responsibility in the online world, and the responses were primarily from what appeared to be teens. I have no connection to the universe that is teenage but with this question I suddenly found myself in a very brief but relevant dialog with the people who are affected by the question.

Third, and I guess this shouldn’t be a surprise, you can get better information from people in this world than you can from your limited scope of offline friends.


Photo: _Faith

I was watching Prince perform on American Idol and kept asking myself, “What is it with this guy? Why is Prince such a big deal?” I started asking friends the same question, even people who are big Prince fans, and I couldn’t get a good answer. So, I posted a question on Answers. I figured that if someone could tell me why Prince matters then maybe it was actually useful in addition to being cool.

Sure enough, I got a couple of funny answers within a few minutes, but within about half an hour somebody convinced me that Prince was worth caring about:

“Prince is able to play multitude of instruments and genres. He mastered the piano at seven, and 6 instruments by 12. He is the youngest to ever produce his own albums at the age of 19. He takes risks and help define the sounds of the 80’s. He was the first black artist to appear on MTV. Not Michael Jackson.”

But there are a few things I’d love to see Answers do better. It’s so random and dense that I need some kind of UI for surfacing stuff that might matter to me. I like that when I post a question it tries to point me to similar questions that have already been answered.

I also want to have some kind of natural incentive for answering other people’s questions. Points won’t do it. Maybe I don’t get it still, but I don’t have any desire to add my knowledge into the pool, yet.

Lastly, I’d love to see the back end opened up as a service…completely. Community sites of all types including publishers should be able to skin the service for their users which would then contribute more data to the wider knowledge pool. I can imagine a site like PCWorld.com using Answers to help their users help each other answer laptop fix-it types of questions or maybe storage device shopping advice.

Obviously, my comments have to be taken with a grain of salt since I share the same employer as the Answers team. But the purpose in writing this was less about promotion and more about exploring social incentives online with a new real world example. There’s more to learn from Answers, I’m sure, but there’s lots to be emulated, as well.