The research paper introduces LaTRO, a novel framework for enhancing Large Language Models' (LLMs) reasoning abilities during training. LaTRO formulates reasoning as sampling from a latent distribution, optimising it via a self-rewarding mechanism that uses the LLM's own probability estimates. This allows LLMs to improve both their reasoning process and their ability to evaluate reasoning quality without external feedback. Experiments on GSM8K and ARC-Challenge datasets demonstrate significant performance improvements across multiple LLM architectures, exceeding both baseline models and supervised fine-tuning. The method addresses limitations of prior approaches by avoiding reliance on external reward models or task-specific examples.
Project page: github.com/SalesforceAIResearch/LaTRO
Paper PDF: arxiv.org/pdf/2411.04282
コメント