Open source grid computing takes off

This has been fun to watch. The Hadoop team at Yahoo! is moving quickly to push the technology to reach its potential. They’ve now adopted it on one of the most important applications in the entire business, Yahoo! Search.

From the the Hadoop Blog:

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

  • Number of links between pages in the index: roughly 1 trillion links
  • Size of output: over 300 TB, compressed!
  • Number of cores used to run a single Map-Reduce job: over 10,000
  • Raw disk used in the production cluster: over 5 Petabytes

I’m still trying to figure out what all this means, to be honest, but Jeremy Zawodny helps to break it down. In this interview, he gets some answers from Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development):

The Hadoop project is opening up a really interesting discussion around computing scale. A few years ago I never would have imagined that the open source world would be contributing software solutions like this to the market. I don’t know why I had that perception, really. Perhaps all the positioning by enterprise software companies to discredit open source software started to sink in.

As Jeremy said, “It’s not just an experiment or research project. There’s real money on the line.

For more background on what’s going on here, check out this article by Mark Chu-Carroll “Databases are hammers; MapReduce is a screwdriver”.

This story is going to get bigger, I’m certain.

Targeting ads at the edge, literally

Esther Dyson wrote about a really interesting area of the advertising market in an article for The Wall Street Journal.

She’s talking about user behavior data arbiters, companies that capture what users are doing on the Internet through ISPs and sell that data to advertisers.

These companies put tracking software between the ISP and a user’s HTTP requests. They then build dynamic and anonymous profiles for each user. NebuAd, Project Rialto, Phorm, Frontporch and Adzilla are among several companies competing for space on ISPs’ servers. And there’s no shortage of ad networks who will make use of that data to improve performance.

Esther gives an example:

“Take user number 12345, who was searching for cars yesterday, and show him a Porche ad. It doesn’t matter if he’s on Yahoo! or MySpace today — he’s the same number as yesterday. As an advertiser, would you prefer to reach someone reading a car review featured on Yahoo! or someone who visited two car-dealer sites yesterday?”

Behavioral and demographic targeting is going to become increasingly important this year as marketers shift budgets away from blanket branding campaigns toward direct response marketing. Over the next few years advertisers plan to spend more on behavioral, search, geographic, and demographic targeting, in that order, according to Forrester. AdWeek has been following this trend:

“According to the Forrester Research report, marketer moves into areas like word of mouth, blogging and social networking will withstand tightened budgets. In contrast, marketers are likely to decrease spending in traditional media and even online vehicles geared to building brand awareness.”

We tried behavioral targeting campaigns back at InfoWorld.com with mild success using Tacoda. The main problem was traffic volume. Though performance was better than broad content-targeted campaigns, the target segments were too small to sell in meaningful ways. The idea of an open exchange for auctioning inventory might have helped, but at the time we had to sell what we called “laser targeting” in packages that started to look more like machine gun fire.

This “edge targeting” market, for lack of a better term, is very compelling. It captures data from a user’s entire online experience rather than just one web site. When you know what a person is doing right now you can make much more intelligent assumptions about their intent and, therefore, the kinds of things they might be more interested in seeing.

It’s important to emphasize that edge targeting doesn’t need to know anything personally identifiable about a person. ISP’s legally can’t watch what known individuals are doing online, and they can’t share anything they know about a person with an advertiser. AdWeek discusses the issue of advertising data optimization in a report title “The New Gold Standard“:

“As it stands now, consumers don’t have much control over their information. Direct marketing firms routinely buy and sell personal data offline, and online, ad networks, search engines and advertisers collect reams of information such as purchasing behavior and Web usage. Google, for instance, keeps consumers’ search histories for up to two years, not allowing them the option of erasing it.

Legalities, however, preclude ad networks from collecting personally identifiable information such as names and addresses. Ad networks also allow users to opt out of being tracked.”

Though a person is only identified as a number in edge targeting, that number is showing very specific intent. That intent, if profiled properly, is significantly more accurate than a single search query at a search engine.

I suspect this is going to be a very important space to watch in the coming years.

Local news is going the wrong way

Google’s new Local News offering misses the point entirely.

As Chris Tolles points out, Topix.net and others have been doing exactly this for years. Agregating information at the hyperlocal level isn’t just about geotagging information sources. Chris explains why they added forums:

“…there wasn’t enough coverage by the mainstream or the blogosphere…the real opportunity was to become a place for people to publish commentary and stories.”

He shouldn’t worry about Google, though. He should worry more about startups like Outside.in who upped the ante by adding a slightly more social and definitely more organic experience to the idea of aggregating local information.

Yet information aggregation still only dances around the real issue.

People want to know what and who are around them right now.

The first service that really nails how we identify and surface the things that matter to us when and where we want to know about them is going to break ground in a way we’ve never seen before on the Internet.

We’re getting closer and closer to being able to connect the 4 W’s: Who, What, Where and When. But those things aren’t yet connecting to expose value to people.

I think a lot of people are still too focused on how to aggregate and present data to people. They expect people to do the work of knowing what they’re looking for, diving into a web page to find it and then consuming what they’ve worked to find.

There’s a better way. When services start mixing and syndicating useful data from the 4 W vectors then we’ll start seeing information come to people instead.

And there’s no doubt that big money will flow with it.

Dave Winer intuitively noted, “Advertising will get more and more targeted until it disappears, because perfectly targeted advertising is just information. And that’s good!”

I like that vision, but there’s more to it.

When someone connects the way information surfaces for people and the transactions that become possible as a result, a big new world is going to emerge.

How to launch an online platform (part II)

The MySpace guys won the latest launch party battle. About 200 people met at the new MySpace house last night in San Francisco to see what the company was going to do to compete with Facebook on the developer front.

MySpace FlipThey had a fully catered event including an open bar with some good whiskey. The schwag bag included the Flip digital video camera (wow!). There were a small handful of very basic demos on the floor from the usual suspects (Slide, iLike, Flixster, etc.). And the presentation was short and sweet so we could get back to socializing.

Nicely executed.

The party wasn’t without flaw, mind you.

First, the date. Why throw a launch party on the same day as the biggest political event in our time, Super Tuesday? The headlines were on everything but the MySpace launch. The right people knew what was going on, but the impact was severely muted. I was somewhat eager to leave to find out what was happening out there in the real world.

Second, the presentation. You have to appreciate them keeping it super short. Once the drinks start flowing, it gets very hard to keep people quiet for more than a few minutes. But I think most everyone there was actually very interested in hearing something meaty or a future vision or something. Bullets on a powerpoint rarely impress.

Neither of those things really mattered, in the end. The party served its purpose.

It also occurred to me afterward that it would have been a shame if the co-founders and executive team weren’t there. But they were very much in this and made themselves accessible to chat. This isn’t a sideshow move for MySpace. It matters to them.

Contrast this with the standard formula followed by the Bebo guys, and you can see why MySpace does so well in social networking. They embody it as a company.

Now, whether or not they can raise the bar on app quality or improve on distribution for apps is yet to be seen. By giving developers a month to get their submissions in for the end-user roll out they are resetting the playing field. That’s great. But I’m not sure whether the MySpace user experience will encourage the sharing of apps as fluidly as the FaceBook UE. I don’t use it enough to know, to be honest.

As far as the platform itself goes, I’m curious about the impact the REST API will have. I’ve wondered how the social networks would make themselves more relevant in the context of the world outside the domain.

Will the REST API be used more by services that want to expose more data within MySpace or by services that want to leverage the MySpace data in their own environments outside myspace.com? I suspect the latter will matter more over time but that won’t mean anything until people adopt the apps.

Overall, good show. This should help bring back some of the MySpace cool that was lost the last year or so.

Preview of the del.icio.us publisher api

I just posted a short screencast on the YDN blog of the cool new publisher api coming from del.icio.us soon. I’ve also embedded the video below. Lots of interesting possibilities with this new service, for sure.

Embed video:
“>