Enter The Data Wars

Enter The Data Wars

Let’s start with a thought experiment:

At what point does the following text stop describing the current timeline, and transition into science fiction? Read carefully.

The year is 2023. A small number of tech conglomerates have mobilized their resources to develop large language models and lay the groundwork for advanced artificial intelligence. These AI models are trained using massive amounts of data gathered from the public internet, and an arms race quickly develops between competing labs to accumulate as much of it as possible. This is the beginning of the historical event we refer to as The Data Wars.

Insulated from the inner workings of the tech world, the public is largely unaware of the competition until its consequences begin to seep into daily life. Realizing that they own the proprietary data sets needed for AI models to be trained, major social media sites begin to erect barriers around their content, triggering civil strife between system administrators and unpaid moderation teams. Blackouts become common, rate limits are placed on content, and identity verification captchas begin to appear on ordinary websites with more and more regularity. There is a sense that the internet is transitioning into some undefined new chapter, but nobody seems to understand how or why.

Finally: for most people, these events occur against a specific and beleaguering economic backdrop following years of dovish monetary policy. For decades, wealth inequality has skyrocketed around the globe as assets appreciated while labor was devalued. Ordinary people understand this, but feel they’ve been denied access to investment opportunities early enough to improve their situation. The average internet user simply views the Data Wars as infighting between factions of the Silicon Valley elite with no possible benefit to themselves, and the public assumption is that Facebook will extract our data, Google will train AI to replace our jobs, and ordinary people will be left with no stake in the game at all.

Alternative Futures

As you’ve probably gathered by now, none of this is actually fiction. All of these events are real, and they’re unfolding in real time. The LLMs emerging today demand vast amounts of data, and developers like DeepMind and OpenAI are now locked in a negotiation process with Reddit and Twitter to determine who has access to it and for what price. When Reddit shuts down, or when Twitter limits the number of tweets that users can view, it is simply collateral damage from this ongoing debate between AI labs and social media sites.

There is a third party, however, who is conspicuously missing from these negotiations: you. And while the content we’re referring to in these conversations is hosted by websites like Reddit or Twitter, it’s important to never forget who actually produces it. We do, every day. So why does the public not benefit when information is sold from the public web?

To date, AI labs have either harvested this public data for free, as in the Reddit API, or paid substantial sums for it, as in the Facebook API. Both of these are outgrowths of the Web 2.0 business model, in which social media companies extract our content and build their business distributing it. As blockchain technology introduces the promises of Web3, however, there is now another alternative to these options. That alternative is Grass.

As the value of public web data begins to go parabolic, Grass is an attempt to democratize the networks through which this data is gathered. Instead of Microsoft getting it for free, and instead of Reddit privatizing it for personal gain, it is a third path that enables the public to collectively seize the rails by which massive tech conglomerates access information, and thereby distribute the proceeds.

But first, let’s back up.

How is this data procured in the first place, and how is Grass different?

Existing Methods for Gathering Data

Over the past few months, social media sites have pursued two different strategies as they respond to the escalating data war: limiting the extent to which people can scrape free information from their websites, or maximizing the profits from acting as its gatekeeper. The highest profile examples are Reddit and Twitter, and their respective policy changes reflect both of these approaches. By examining them, we can learn a bit about the different ways that AI companies gather data and how Grass fits in with the existing systems.

1. Raising Prices on the Reddit API

One method for gathering data from the public web — the most straightforward and least clandestine — is to perform an API call. An API is simply a set of URLs that return data in a structured format, and websites employ them as the official method for third parties to interact with the site. Some sites will charge for their proprietary datasets; others give them away for free — but in both cases, the data is accessed through the official API.

Until April of this year, Reddit fell into the latter category. As far back as 2008, anyone could call on the Reddit API to retrieve structured data sets of all the information ever posted to the site, ever. As the data wars began to heat up, however, it emerged as a natural repository for massive amounts of written language, which LLMs comb through for patterns to hone their linguistic abilities. Around the time IPO rumors began to circulate, Reddit realized they were providing the raw material for developing the most lucrative technology ever devised by man, and promptly decided to get in on the action. Prices were raised for the first time in the website’s history, and Reddit now officially charges to download its structured data.

As an unintended consequence of that decision, many of the third party mobile apps that people use to access the site will now be pushed out of business. These apps call on the API any time they display Reddit content to their users and allow them to interact with it — calls that will simply cost too much to continue making. Enraged at the loss of their beloved front ends, moderators across the Reddit ecosystem shuttered their forums in protest, some for a few days, some forever. Users committed to leaving the site forever as well, though it remains to be seen how serious a threat this is to the company’s robust user base.

However you feel about the Reddit controversy, it provides a valuable example of how APIs work and how they mediate the relationship between AI labs and social media sites. Simply put, APIs provide a sanctioned method for gathering data that allows websites to retain profits from the sale of their content. This is the method that most benefits the social media companies. Let’s see if there are any other options, though…

2. Twitter’s Struggle to Limit Web Scraping

Hot on the heels of Reddit’s moderator rebellion, Twitter CEO Elon Musk issued a handful of cryptic tweets to suggest the following: data scraping on the platform had spiraled out of control, new accounts would be limited to viewing 300 tweets per day, and no further explanation would be provided. While users had a variety of reactions, few took Musk’s statement seriously, or even bothered to ask what he meant by it. What exactly is web scraping, and why would it be such a problem?

As it turns out, web scraping is our second method for gathering data from the public internet, and it can be used, at times, to escape the limitations of calling an API. While interacting with an API accesses a website’s data on the backend, web scraping involves viewing the publicly available websites and extracting data from the HTML without assistance. Scraping offers more flexibility than using APIs, and significantly, it can also be used to circumvent the cost of purchasing proprietary data sets.

Web scraping is typically carried out from data centers, so when hosts crack down on it, they do so by blocking these IP addresses. This is where decentralized alternatives come into play. In essence, they route traffic through the IP addresses of residential internet users who run nodes on their networks. By doing so, public web sites can be accessed once more, and clients can often circumvent attempts to prevent web scraping.

Suffice it to say, for Twitter to take the steps it did in direct response to the explosion of scraping from the ongoing data wars, they must recognize that it poses a serious threat to a potential revenue stream for a social media site. If the API method of data gathering privileges websites in their quest to maximize profits, web scraping may put the power back in the hands of the AI labs seeking to amass low cost data. This is the method that most benefits the AI Companies.

The Third Path

Looking at these examples, we can draw a number of conclusions about the data wars. One of them, however, stands out above all others: that it makes no difference at all to individual users how data is gathered from the web! Public web data is rapidly becoming a high value commodity for its role in developing AI, but whether APIs are called or the web is scraped, no compensation is distributed to the masses whatsoever. These options simply exist on a spectrum of who profits more: Microsoft’s OpenAI, or Elon Musk’s Twitter.

Here’s the thing.

As we hinted before, Grass is a third alternative that democratizes these processes. It is a network, supported by users just like you, that can be used to access public web data directly. For now, this data is accessed as is, but soon the protocol will even clean and prepare it, to be sold as structured datasets to labs who require it for AI training.

And the most important part? Grass will give users a way to contribute to AI data provisioning and earn a stake in the AI revolution. Finally, the world can get a fair shake.

Details will continue to be released about Grass’s mechanics and design, as well as more articles about the data wars and the role that you can play as a Grass user.

The beta is now live, come and check it out. Follow us on Twitter and join us in the Discord for more information.