I was also intrigued with your thoughts about deduction. In particular, that LLMs are clearly mimicking deduction (quite successfully), that CoT is a way to prompt the LLM to mimic deduction, but that *real* deductive functionality will require another approach.
But as you say (and show) very good mimicry of deduction can get us a long way. So, while we wait for true deductive functionality, I wonder if an LLMs deductive abilities could be greatly enhanced by fine tuning them on vast amounts of deductive reasoning (both correct and incorrect). We could prompt an LLM to write code (general principles) to carry out a transformation of an input_1 (it would not actually matter that the generated code correctly implemented the requested transformation). We would then run the code on the input_1 and record the output as the correct output (example of correct deduction: applying code "general principles" to specific input to deduce specific output). We can also make modifications to this output (and record the modified output as incorrect outputs given the program and the input_1 = example of erroneous deduction). We can repeat this for a whole set of inputs: input_2, input_3, etc.
The above can be repeated for a vast number of transformations (not limited to those that are relevant for ARC). The resulting dataset of programs and their associated sets of correct and incorrect input-output pairs can be used to fine-tune an LLM to become a top-notch deductive reasoning mimic. I would have thought that the creation of this fine-tuning dataset would be relatively straight-forward. It would be interesting to see whether this materially improves the LLM performance on ARC or other tests requiring deductive reasoning.
It is possible that the training corpus of LLMs already contains the type of data I describe above. Do you know if it does?
I know it contains vast amounts of code (including comments) and, in many cases, the accompanying requirements and use cases, in other words the general principles. But does it contain many examples of correct and incorrect application of the code to specific inputs?
I also think such a dataset might be useful for training and evaluating a language model that had *true* deduction functionality (like you describe in your write up)? A lot of texts (and even lots of code) contain very little deductive reasoning. A language model with an architecture that can learn true deductive reasoning will be more efficiently trained on a dataset that is dense in deductive reasoning examples.
* your search strategy has two arms independent arms: non-pooling and pooling. What about trying to integrate the two e.g. by alternating between picking the x best programs (and running a revision round), and by pooling complementary programs. This might allow for a better exploration of program space...? Monte Carlo Tree Search tries to find a good balance when alternating between exploitation vs exploration, and something similar can perhaps be achieved here instead of separating the two approaches.
* in evolution, the mutation rate is also a very important factor (in addition to pooling, know as sex in biology). The temperature and top_p settings in ChatGPT (in Sonnet it probably has another name) may be considered an analog to this. Perhaps one could dynamically dial the temperature of the LLM up and down to obtain better results. For example, for a task where it is proving difficult to get a program that is close to a solution, one could turn the temperature up.
* Another idea would be to investigate how to coordinate changes in temperature with switching between pooling and non-pooling: maybe temperature should be reduced in pooling generations relative to non-pooling generations (since pooling already adds diversity and we do not need even more diversity through temperature increase) or maybe temperature needs to be increased to allow the LLM to be "flexible" enough to create an child that successfully integrates the strengths of both parents.
Awesome article, your section “Arc and the Path to AGI” mirrors something i heard from Naval recently on a podcast.
i’m curious, for your second point “test-time compute is a proxy for deductive reasoning” don’t you feel as though a large category of problems people currently work on are probably verifiable ? certainly large swaths of software engineering fall into this category.
Thank you for the fascinating writeup!
I was also intrigued with your thoughts about deduction. In particular, that LLMs are clearly mimicking deduction (quite successfully), that CoT is a way to prompt the LLM to mimic deduction, but that *real* deductive functionality will require another approach.
But as you say (and show) very good mimicry of deduction can get us a long way. So, while we wait for true deductive functionality, I wonder if an LLMs deductive abilities could be greatly enhanced by fine tuning them on vast amounts of deductive reasoning (both correct and incorrect). We could prompt an LLM to write code (general principles) to carry out a transformation of an input_1 (it would not actually matter that the generated code correctly implemented the requested transformation). We would then run the code on the input_1 and record the output as the correct output (example of correct deduction: applying code "general principles" to specific input to deduce specific output). We can also make modifications to this output (and record the modified output as incorrect outputs given the program and the input_1 = example of erroneous deduction). We can repeat this for a whole set of inputs: input_2, input_3, etc.
The above can be repeated for a vast number of transformations (not limited to those that are relevant for ARC). The resulting dataset of programs and their associated sets of correct and incorrect input-output pairs can be used to fine-tune an LLM to become a top-notch deductive reasoning mimic. I would have thought that the creation of this fine-tuning dataset would be relatively straight-forward. It would be interesting to see whether this materially improves the LLM performance on ARC or other tests requiring deductive reasoning.
It is possible that the training corpus of LLMs already contains the type of data I describe above. Do you know if it does?
I know it contains vast amounts of code (including comments) and, in many cases, the accompanying requirements and use cases, in other words the general principles. But does it contain many examples of correct and incorrect application of the code to specific inputs?
I also think such a dataset might be useful for training and evaluating a language model that had *true* deduction functionality (like you describe in your write up)? A lot of texts (and even lots of code) contain very little deductive reasoning. A language model with an architecture that can learn true deductive reasoning will be more efficiently trained on a dataset that is dense in deductive reasoning examples.
Thanks for a very understandable writeup.
Following up on the evolutionary approach:
* your search strategy has two arms independent arms: non-pooling and pooling. What about trying to integrate the two e.g. by alternating between picking the x best programs (and running a revision round), and by pooling complementary programs. This might allow for a better exploration of program space...? Monte Carlo Tree Search tries to find a good balance when alternating between exploitation vs exploration, and something similar can perhaps be achieved here instead of separating the two approaches.
* in evolution, the mutation rate is also a very important factor (in addition to pooling, know as sex in biology). The temperature and top_p settings in ChatGPT (in Sonnet it probably has another name) may be considered an analog to this. Perhaps one could dynamically dial the temperature of the LLM up and down to obtain better results. For example, for a task where it is proving difficult to get a program that is close to a solution, one could turn the temperature up.
* Another idea would be to investigate how to coordinate changes in temperature with switching between pooling and non-pooling: maybe temperature should be reduced in pooling generations relative to non-pooling generations (since pooling already adds diversity and we do not need even more diversity through temperature increase) or maybe temperature needs to be increased to allow the LLM to be "flexible" enough to create an child that successfully integrates the strengths of both parents.
Here https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683 they suggest to run with a big increase in temperature for Exploratory code writing (Generates code that explores alternative solutions and creative approaches. Output is less constrained by established patterns)
Awesome article, your section “Arc and the Path to AGI” mirrors something i heard from Naval recently on a podcast.
i’m curious, for your second point “test-time compute is a proxy for deductive reasoning” don’t you feel as though a large category of problems people currently work on are probably verifiable ? certainly large swaths of software engineering fall into this category.
I do think large swaths are verifiable. For those fields I’d expect next token LLMs to reach human level ability
Isn't this just overfitting to training data?
Thank you for the exciting notes. Ryan's COT prompts are cited quite often here. Could you please add the link too?
https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt