Published on Sep 19
Authors:Juntao Li,Zecheng Tang,Yuyang Ding,Pinzheng Wang,Pei Guo,Wangjie You,Dan Qiao,Wenliang Chen,Guohong Fu,Qiaoming Zhu,Guodong Zhou,Min Zhang
Abstract
Large language models (LLMs) with billions of parameters have demonstrated outstanding performance on various natural language processing tasks. This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model, to contribute an LLM variant to the Chinese-oriented open-source model community. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can also achieve very competitive performance with only 380B tokens, which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides the main details to pre-train an analogous model, including pre-training data processing, Bilingual Flan data collection, the empirical observations that inspire our model architecture design, training objectives of different stages, and other enhancement techniques. We have refactored our code to follow the design principles of the Huggingface Transformers Library, making it more convenient for developers to use, and released checkpoints of different training stages at https://huggingface.co/openBA. More details of our project are available at https://github.com/OpenNLG/openBA.git.
Commentary
The paper "OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch" introduces a bilingual large language model tailored for Chinese-oriented applications.
Key Takeaways:
Bilingual Model: OpenBA is a bilingual model catering to Chinese-oriented tasks, thereby filling a gap in the LLM community.
Efficient Techniques and Training Strategy: The authors use efficient techniques and a three-stage training strategy to train the model from scratch. This ensures the model is competitive despite being trained on fewer tokens compared to other large models.
Competitive Performance: The OpenBA achieves better performance on multiple benchmarks compared to other state-of-the-art models, with fewer tokens. This suggests efficient architecture and training strategies.
Open Source: The authors have made their model open source and integrated it with the Huggingface Transformers Library. This facilitates easier adoption by developers and researchers.
Potential Real-World Impact:
Bilingual Tasks: OpenBA's bilingual nature can address a wide range of tasks that involve English and Chinese languages, expanding the applicability of LLMs in Chinese-speaking regions and bilingual applications.
Promotion of Chinese-Oriented Research: The model's focus on Chinese-oriented tasks can encourage more research and applications catering to this significant language group.
Accessible Tool: With the model being integrated into Huggingface and the associated code being open-sourced, developers and researchers can easily adopt, modify, and extend this model for various applications.
Benchmark Performance: The superior performance on benchmarks hints at the possibility of this model becoming a standard or reference in bilingual NLP tasks involving Chinese.
Challenges:
Specialized Nature: While the model is powerful for bilingual tasks involving Chinese, its specialization might limit its broader applicability across other languages.
Resource Intensiveness: As with other large models, real-time applications or deployments in resource-constrained environments might face challenges.
Given the potential for breakthroughs in bilingual tasks involving Chinese and its contribution to the open-source community:
I'd rate the real-world impact of this paper as an 8 out of 10.
OpenBA fills a specific niche in the LLM world by catering to bilingual tasks involving Chinese. The open-source nature and integration with popular platforms will likely promote its adoption and stimulate further research in the Chinese NLP community.