AI needs the open Internet and other things we learned because of Brexit

Sir Tim Berners-Lee has championed a major upgrade to the World Wide Web for several years now. He calls it the Semantic Web. It’s a design standard for connecting data, not just web pages.

Most people have never heard of the Semantic Web. I kind of thought it was dead, too, but as we’ve been linking data at Kaleida to understand how things impact each other in the news it appears Sir Tim’s idea may have actually happened while nobody was looking.

It started to appear for us when we linked concepts together and charted them over time, like Google Trends focused more on news data.

Last week was a good if not horrifying example as Syria reentered the news cycle. On Monday reports of ISIS taking control of Palmyra hit. Then reports emerged that the Syrian army is going from house to house and executing residents on the spot.

These stories resonated with readers in a way that other horrors in Syria have failed to have impact in the past.

Number of articles about Syria vs number shares of those articles, Sept to Dec 2016. Source: Kaleida Data

This graph shows the number of stories about Syria produced by leading publishers compared with the number of shares of those stories on Facebook.

It says publishers have been covering Syria for a while, as they should, but people haven’t shared those stories very much. It’s good to see publishers committed to coverage of an important issue that has little direct benefit to them as a business. But now we can finally see signs of impact.

It’s hard to tell precisely where the source of the change occurred, but there can be no doubt that The Telegraph’s piece by Julie Lenarz and Huffington Post UK’s piece by Sarah Ann Harris both had a huge impact in turning up the volume. Those two articles account for nearly 1 million shares on Facebook.

We don’t yet understand the relationship between the news and search trends, but we see similar activity for the term “Aleppo” in Google.

Source: Google Trends

While we’re primarily focused on the stories in the data, the stories about the data are pretty interesting, too.

Charting this kind of impact is possible not just because data visualization tools have grown up, which they have. This is possible because machines can tell us when lots of different things share common meanings.

When we started deconstructing and then connecting articles to data that has shared meaning we discovered some surprising terms.

The most obvious example is Brexit. The official term for “Brexit”, according to Wikipedia, is “United Kingdom European Union membership referendum, 2016”.

The reoccurrence of this bizarre term in multiple APIs reminded me how map providers sometimes insert a nonexistent street into their data. It acts almost like a watermark so they can track uses of their data. When you see that nonexistent street you know where the data came from.

Brexit has become a sort of watermark for Wikipedia. The longer, more technically accurate, rolls-off-your-tongue-like-peanut-butter official term appears in Google APIs, Facebook APIs, and in many other major data sets that make the Internet work.

Wikipedia is already connecting tons of things on the Internet much in the way the Semantic Web promised. It can even connect different languages.

It makes perfect sense, actually. Why shouldn’t Wikipedia serve as connective tissue for the Internet? There is no better, more reliable, more comprehensive source of knowledge in the world…at least not yet. And even when it’s wrong the self-healing capabilities of Wikipedia’s data are incredible. Mind-boggling really.

As machine learning and AI infiltrates every digital service we engage with Natural Language Processing becomes more and more important. All those technologies need to understand what words mean and what we intended when people said them. Guess what data those services use to learn about words and what they mean? Wikipedia.

Our preferred NLP service at Kaleida is Aylien, but many of the others also use Wikipedia to train their algorithms.

It makes me wonder if the incredible advances in machine learning and AI over the last 3–5 years would have even been possible without Wikipedia.

Now, Sir Tim Berners-Lee’s idea for the Semantic Web is more clever than simply identifying words. It has a structure to it that maps nicely to the way we think, human language and the relationships word combinations form together. He’s trying to make the Internet into a global brain.

“The brain has no knowledge until connections are made between neurons. All that we know, all that we are, comes from the way our neurons are connected.” — Sir Tim Berners-Lee

The full Semantic Web is indeed much further out than what we have now. I’m not convinced we’ll get there using the standards that have been proposed (see the incomprehensible chart on the left here from the team behind the standard titled, “The Semantic Web Made Easy”). People are messy creatures. And even when we thrive on organizing things in certain ways the practicalities of achieving goals and making money will generally beat out the formation of perfect structures.

But the idea was right, and it’s actually happening.

The inventor of the Web is well placed to see where the value lies within the Internet’s nervous system. He was probably right about linked data. And just like the World Wide Web before it the Semantic Web probably grew far beyond the scope of his imagination.

PS — This post wasn’t meant to be an appeal for Wikipedia donations. I have no affiliation with Wikipedia. It’s just that I realized recently that I’ve misjudged how important it is and how deeply woven it is into the world’s conscience.

Take a moment to imagine the world without Wikipedia. Securing its future is a small thing all of us can do to combat forces that wish to control public opinion or, worse, shut down independent voices.