Iterative preference optimization methods have recently been shown to perform
well for general instruction tuning tasks, but typically make little
improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this
work we develop an iterative approach that optimizes the preference between
competing generated Chain-of-Thought (CoT) candidates by optimizing for winning
vs. losing reasoning steps that lead to the correct answer. We train using a
modified DPO loss (Rafailov et al., 2023) with an additional negative
log-likelihood term, which we find to be crucial. We show reasoning improves
across repeated iterations of this scheme. While only relying on examples in
the training set, our approach results in increasing accuracy on GSM8K, MATH,
and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based
models not relying on additionally sourced datasets. For example, we see a
large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with
majority voting out of 32 samples.