Training and Evaluating Language Models with Template-based Data Generation

University of California, Los Angeles

Abstract

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine-tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by employing GPT-4 for meta-template creation, guaranteeing diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills. The code and data are available at https://github.com/iiis-ai/TemplateMath.

Template-based Data Generation

Template-based Data Generation (TDG) is a method designed to systematically produce a vast array of mathematical problems along with their corresponding solutions by leveraging parameterized templates. To elevate data augmentation to a new level, we employ GPT-4 to generate these meta-templates, capturing a wide variety of problem structures and linguistic styles. By varying parameters within these GPT-4-generated templates, TDG ensures both scalability and quality in the generated data. This approach enables the creation of diverse and complex problem sets, which are essential for training and evaluating large language models in mathematical reasoning tasks.


MY ALT TEXT

MY ALT TEXT

Main Contributions

Template-based Data Generation (TDG): We introduce TDG, a scalable method for generating an effectively infinite amount of high-quality, domain-specific data using parameterized templates generated by GPT-4.

Elevated Data Augmentation: By leveraging GPT-4 to create meta-templates, we advance data augmentation to a new level, ensuring a diverse and rich set of problem structures for data synthesis.

Creation of TemplateGSM Dataset: We develop TemplateGSM, a dataset comprising over 7 million synthetically generated math problems with verified solutions, addressing the scarcity of large-scale mathematical datasets.

Enhancement of LLM Performance: We demonstrate the efficacy of the generated data for pre-training, fine-tuning, and evaluating LLMs, enhancing their performance in mathematical reasoning tasks.

Precise Supervision through Code Execution: We provide insights into how TDG offers precise supervision through code execution and verification, promoting the development of models with improved understanding and problem-solving abilities.

Citation

Please cite the paper and star this repo if you use Template-based Data Generation (TDG), TemplateMath, and TemplateGSM, and find it interesting/useful, thanks!

@misc{zhang2024training,
    title={Training and Evaluating Language Models with Template-based Data Generation},
    author={Zhang, Yifan and et. al},
    journal={ICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models; arXiv preprint arXiv:2411.18104},
    year={2024},
}
}