Training and Evaluating Language Models with Template-based Data Generation

Abstract

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems—each accompanied by code-based and natural language solutions—with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at https://huggingface.co/datasets/math-ai/TemplateGSM.

Template-based Data Generation

Template-based Data Generation (TDG) is a method designed to systematically produce a vast array of mathematical problems along with their corresponding solutions by leveraging parameterized templates. To elevate data augmentation to a new level, we employ GPT-4 to generate these meta-templates, capturing a wide variety of problem structures and linguistic styles. By varying parameters within these GPT-4-generated templates, TDG ensures both scalability and quality in the generated data. This approach enables the creation of diverse and complex problem sets, which are essential for training and evaluating large language models in mathematical reasoning tasks.

Main Contributions

Template-based Data Generation (TDG): We introduce TDG, a scalable method for generating an effectively infinite amount of high-quality, domain-specific data using parameterized templates generated by GPT-4.

Elevated Data Augmentation: By leveraging GPT-4 to create meta-templates, we advance data augmentation to a new level, ensuring a diverse and rich set of problem structures for data synthesis.

Creation of TemplateGSM Dataset: We develop TemplateGSM, a dataset comprising over 7 million synthetically generated math problems with verified solutions, addressing the scarcity of large-scale mathematical datasets.

Enhancement of LLM Performance: We demonstrate the efficacy of the generated data for pre-training, fine-tuning, and evaluating LLMs, enhancing their performance in mathematical reasoning tasks.

Precise Supervision through Code Execution: We provide insights into how TDG offers precise supervision through code execution and verification, promoting the development of models with improved understanding and problem-solving abilities.

Citation

Please cite the paper and star this repo if you use Template-based Data Generation (TDG), TemplateMath, and TemplateGSM, and find it interesting/useful, thanks!

@article{zhang2024training,
    title={Training and Evaluating Language Models with Template-based Data Generation},
    author={Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew Chi-Chih},
    journal={arXiv preprint arXiv:2411.18104},
    year={2024},
}
}