This is a linkpost for https://dynomight.net/data-wall/

Say you have a time machine. You can only use it once, to send a single idea back to 2005. If you wanted to speed up the development of AI, what would you send back? Many people suggest attention or transformers. But I’m convinced that the answer is “brute-force”—to throw as much data at the problem as possible.

AI has recently been improving at a harrowing rate. If trends hold, we are in for quite a show. But some suggest AI progress might falter due to a “data wall”. Current language models are trained on datasets fast approaching “all the text, ever”. What happen when it runs out?

Many argue this data wall won’t be a problem, because humans have excellent language and reasoning despite seeing far less language data. They say that humans must be leveraging visual data and/or using a more data-efficient learning algorithm. Whatever trick humans are using, they say, we can copy it and avoid the data wall.

I am dubious of these arguments. In this post, I will explain how you can be dubious, too.

1 comments, sorted by Click to highlight new comments since: Today at 12:23 AM
New Comment

Ok. Firstly I do think your "Embodied information" is real. I just think it's pretty small. You need the molecular structure for 4 base pairs of DNA, and for 30 ish protiens. And this wikipedia page. https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables

That seems to be in the kilobytes. It's a rather small amount of information compared to DNA.

Epigenetics is about extra tags that get added. So theoretically the amount of information could be nearly as much as in the DNA. For example, methyization can happen on A and C, so that's 1 bit per base pair, in theory. 

Also, the structure of DNA hasn't changed much since early micro-organisms existed. Neither has a lot of the other embodied information. 

Therefore the information doesn't contain optimization over intelligence, because all life forms with a brain had the same DNA. 

 

Humans are better than LLM's at highly abstact tasks like quantum physics or haskel programming.

You can't argue that this is a result of billions of years of evolution. Sea sponges weren't running crude haskel programs a billion years ago. 

 

Therefore, whatever data the human brain has, it is highly general information about intelligence.

 

Suppose we put the full human genome, plus a lot of data about DNA and protein structure, into the LLM training data. In theory, the LLM has all the data that evolution worked so hard to produce. In practice, LLM's aren't smart enough to come up with fundamental insights about the nature of intelligence from the raw human genome.

 

So there is some piece of data, with a length between a few bits and several megabytes, that is implicitly encoded in the human genome, and that describes an algorithm for higher intelligence in general. 

 If it’s a collection of millions of unintelligible interacting “hacks” tuned to statistical properties of the environment, then maybe not.

 

Well those "hacks" would have to generalize well. Modern humans operate WAY out of distribution and work on very different problems. 

Would interacting hacks that were optimized to hunt mammoths also happen to work in solving abstract maths problems? 

So how would this work. There would need to be a set of complicated hacks that work on all sorts of problems, including abstract maths. Abstract maths has limitless training data in theory. And if the hacks apply to all sorts of problems, then data on all sorts of problems is useful in finding the hacks. 

If the hacks contain a million bits of information, and help answer a million true/false questions, then they are in principle findable with sufficient compute. 

 

Also, bear in mind that evolution is INCREADIBLY data inefficient. Yes there are a huge number of ancestors. But evolution only finds out how many children got produced. A human can look at a graph and realize that a 1% increase in parameter X causes a 1% improvement in performance. Evolution randomly makes some individual with 1% more X, and they get killed by a tiger. Bad luck. 

And again. Most of the billions of years there were no brains at all. The gap between humans and monkeyish creatures is a few Million years. 

AIXI is a theoretical model of an ideal intelligence, it's a few lines of maths.

I'm not saying it's totally impossible that there is some weird form of evolution data wall. But mostly this looks like a fairly straightforward insight, possessable, and not possessed by us. I think it's pretty clear that the human algorithm makes at least a modest amount of sense and isn't too hard to find with trial and error on the same training dataset. (When the dataset is large, and the amount of outer optimization is fairly modest, the risk of overfitting in the outer stage is small)