Published on Sep 5
Authors:Zhen Yang,Ming Ding,Qingsong Lv,Zhihuan Jiang,Zehai He,Yuyi Guo,Jinfeng Bai,Jie Tang
Abstract
Previous studies have typically assumed that large language models are unable to accurately perform arithmetic operations, particularly multiplication of >8 digits, and operations involving decimals and fractions, without the use of calculator tools. This paper aims to challenge this misconception. With sufficient training data, a 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with almost 100% accuracy without data leakage, significantly surpassing GPT-4 (whose multi-digit multiplication accuracy is only 4.3%). We also demonstrate that our MathGLM, fine-tuned from GLM-10B on a dataset with additional multi-step arithmetic operations and math problems described in text, achieves similar performance to GPT-4 on a 5,000-samples Chinese math problem test set.
Commentary
Based on the information provided, let's evaluate the potential impact of this paper:
Challenging Established Assumptions: The paper directly challenges a prevailing notion that large language models (LLMs) cannot perform arithmetic operations, especially complex ones, without calculator tools. Revising this assumption could influence how future LLMs are developed and applied.
Accuracy: Achieving almost 100% accuracy in multi-digit arithmetic operations without data leakage is a significant advancement. This level of accuracy means that for tasks requiring arithmetic computations, such an LLM could be directly employed without an external calculator.
Comparison with Previous Models: Demonstrating that a 2 billion-parameter model significantly surpasses GPT-4 (with only 4.3% accuracy in multi-digit multiplication) is a key contribution. This shows that it's not just about the size but also the quality and nature of the training data.
Fine-tuning on Math Problems: Their model, MathGLM, when fine-tuned, achieves performance comparable to GPT-4 on a Chinese math problem test set. This suggests potential for global applications, considering they've demonstrated its efficacy on a non-English dataset.
Potential Real-world Applications: The model's capability can be beneficial for applications like tutoring, where step-by-step arithmetic problem solving is needed. It might also find uses in industries requiring quick arithmetic checks or where integrating a calculator tool is cumbersome.
Scope of Research: While the research demonstrates the model's arithmetic prowess, its application might be limited if it only excels at arithmetic operations. For broad real-world impact, LLMs often need to be versatile across various tasks.
Considering the above factors:
I'd rate the real-world impact of this paper as a 7 out of 10.
While the paper does showcase a commendable achievement in the realm of arithmetic computations by LLMs, the broader applications beyond mathematical computations would determine its widespread impact. The paper can be seen as a significant step towards enhancing the capabilities of LLMs in arithmetic domains.