11 Comments
User's avatar
Jason Benn's avatar

Thank you for the fascinating writeup!

Expand full comment
Tim Hughes's avatar

I was also intrigued with your thoughts about deduction. In particular, that LLMs are clearly mimicking deduction (quite successfully), that CoT is a way to prompt the LLM to mimic deduction, but that *real* deductive functionality will require another approach.

But as you say (and show) very good mimicry of deduction can get us a long way. So, while we wait for true deductive functionality, I wonder if an LLMs deductive abilities could be greatly enhanced by fine tuning them on vast amounts of deductive reasoning (both correct and incorrect). We could prompt an LLM to write code (general principles) to carry out a transformation of an input_1 (it would not actually matter that the generated code correctly implemented the requested transformation). We would then run the code on the input_1 and record the output as the correct output (example of correct deduction: applying code "general principles" to specific input to deduce specific output). We can also make modifications to this output (and record the modified output as incorrect outputs given the program and the input_1 = example of erroneous deduction). We can repeat this for a whole set of inputs: input_2, input_3, etc.

The above can be repeated for a vast number of transformations (not limited to those that are relevant for ARC). The resulting dataset of programs and their associated sets of correct and incorrect input-output pairs can be used to fine-tune an LLM to become a top-notch deductive reasoning mimic. I would have thought that the creation of this fine-tuning dataset would be relatively straight-forward. It would be interesting to see whether this materially improves the LLM performance on ARC or other tests requiring deductive reasoning.

Expand full comment
Tim Hughes's avatar

It is possible that the training corpus of LLMs already contains the type of data I describe above. Do you know if it does?

I know it contains vast amounts of code (including comments) and, in many cases, the accompanying requirements and use cases, in other words the general principles. But does it contain many examples of correct and incorrect application of the code to specific inputs?

Expand full comment
Tim Hughes's avatar

I also think such a dataset might be useful for training and evaluating a language model that had *true* deduction functionality (like you describe in your write up)? A lot of texts (and even lots of code) contain very little deductive reasoning. A language model with an architecture that can learn true deductive reasoning will be more efficiently trained on a dataset that is dense in deductive reasoning examples.

Expand full comment
Tim Hughes's avatar

Thanks for a very understandable writeup.

Following up on the evolutionary approach:

* your search strategy has two arms independent arms: non-pooling and pooling. What about trying to integrate the two e.g. by alternating between picking the x best programs (and running a revision round), and by pooling complementary programs. This might allow for a better exploration of program space...? Monte Carlo Tree Search tries to find a good balance when alternating between exploitation vs exploration, and something similar can perhaps be achieved here instead of separating the two approaches.

* in evolution, the mutation rate is also a very important factor (in addition to pooling, know as sex in biology). The temperature and top_p settings in ChatGPT (in Sonnet it probably has another name) may be considered an analog to this. Perhaps one could dynamically dial the temperature of the LLM up and down to obtain better results. For example, for a task where it is proving difficult to get a program that is close to a solution, one could turn the temperature up.

* Another idea would be to investigate how to coordinate changes in temperature with switching between pooling and non-pooling: maybe temperature should be reduced in pooling generations relative to non-pooling generations (since pooling already adds diversity and we do not need even more diversity through temperature increase) or maybe temperature needs to be increased to allow the LLM to be "flexible" enough to create an child that successfully integrates the strengths of both parents.

Expand full comment
Tim Hughes's avatar

Here https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683 they suggest to run with a big increase in temperature for Exploratory code writing (Generates code that explores alternative solutions and creative approaches. Output is less constrained by established patterns)

Expand full comment
Sam Pettus's avatar

Awesome article, your section “Arc and the Path to AGI” mirrors something i heard from Naval recently on a podcast.

i’m curious, for your second point “test-time compute is a proxy for deductive reasoning” don’t you feel as though a large category of problems people currently work on are probably verifiable ? certainly large swaths of software engineering fall into this category.

Expand full comment
Jeremy Berman's avatar

I do think large swaths are verifiable. For those fields I’d expect next token LLMs to reach human level ability

Expand full comment
toast's avatar

Isn't this just overfitting to training data?

Expand full comment
Dat LQ.'s avatar

Thank you for the exciting notes. Ryan's COT prompts are cited quite often here. Could you please add the link too?

Expand full comment