#21 Corpus Prime
My claim that data from 2006-2016 will *always* be the most valuable resource for AI training
I propose some bounding definition around the period roughly stretching from the creation of Twitter in July 2006 to the election of Donald Trump in November 2016. It’s a 10 year slice that feels so full of a kind of optimism about the internet that drove almost every thinker, writer, creative, and scientist deep into online social media culture. I claim that this period of time will forever be the most honest and active period of internet history, and that we (data archeologists and their algorithms) will be poring over this particular era for data for the rest of time, as it stands out from both its past and future.
I call this period Corpus Prime.
The three claims I make in support of this idea are as follows:
Most of the developed world had internet access (different from the past)
Internet users were generally honest and open in their online behavior (different from the present)
There were very few bots online creating content yet (different from the future)
This time period was the first time that most of humanity had mobile web access, smartphones with cameras, and access to global forums of speech like Twitter, YouTube, etc. People were still excited about these news tools and outlets, and they generally spoke the truth online, which is certainly different from today. In the future, we will act more dishonestly online, but deeper truths will be known about us by corporations and governments - the worst of both worlds, really. So much web content now is created by bots, and this will only increase in the future - so this period of time will forever be unique in its level of human-activity purity.
Why 2006-2016 specifically? 2006 is when Twitter was founded, and 2016 is when Donald Trump was elected - and I view him as the direct result of Twitter and the culture it created. 2016 was also the year that Instagram introduced “Stories”, a feature which directly copied Snapchat’s core product. This was the first significant event in a series of moves that Facebook made to control the social networking market, and it’s another good marker as 2016 as the end run of that first flourishing of social startups that captured humanity’s attention. From 2016 forward, Facebook dominated the web, and data aggregators became more sinister in the public eye. It’s also worth noting the shift that occurred when we moved from posts with URLs and a permanent place on the internet to “ephemerality” as our primary context for the social web.
We will never get another web like the one we had then, but we will be examining that period of time forever for its richness and depth of character. I hope you got to experience it, because the web is entirely different today.
et cetera
The Good:
Google’s AlphaFold, part of its acqui-hired DeepMind research department, has made significant progress in our ability to predict protein folding (Technology Review)
Biden orders crackdown on selling Americans’ personal data abroad (The Verge)
The Bad
DOT to investigate data security and privacy practices of top US airlines (TechCrunch)
Ticketmaster hacked. Breach affects more than half a billion users. (Mashable)
The Ugly
The Incognito Mode myth has fully unraveled (Wired)
Various and Sundry
Today's selfie is tomorrow's biometric profile (Adam Harvey Studio)
EU Council has withdrawn the vote on Chat Control (StackDiary)
Interesting concept, however if we accept all your premises as true, we still must contend with one of the most important concepts in data analytics: recency. With each passing year this dataset becomes less useful as a model for predicting future behavior, which is really the whole point of collecting and analyzing data in the first place right?