Chain-of-thought (CoT) reasoning in vision language
models (VLMs) is crucial for improving
interpretability and trustworthiness. However,
current training recipes often relying on
datasets dominated by short annotations with
minimal rationales. In this work, we show that
training VLM on short answers leads to poor
generalization on reasoning tasks that require
more detailed explanations. To address this limitation,
we propose a two-stage post-training
strategy that extends the usage of short answer
data for enhanced CoT reasoning. First, we
augment short answers with CoT reasoning
generated by GPT-4o, enhancing the VLM’s
CoT capabilities through fine-tuning. Second,
we leverage short answers as outcome rewards
for reinforcement learning. Specifically, short
answers are used as correctness indicators to
construct positive (correct) and negative (incorrect)
pairs from model-generated reasoning
chains. These pairs are then used to calibrate
the model’s reasoning via Direct Preference Optimization.
Our experiments show significant
improvements in CoT reasoning on benchmark
datasets, along with enhanced generalization to
direct answer prediction. This work provides
a critical data resource for VLM CoT training
and demonstrates the effectiveness of outcome
rewards for multimodal models post-training.
- † Work done while at Apple
- ‡ Carnegie Mellon University