Section 2: Pretraining
Section 2 of the paper discusses the pretraining process for the Llama 2 models. The authors made several changes to improve performance, such as more robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability. The training corpus includes a new mix of data from publicly available sources, which does not include data from Meta's products or services. The authors also performed a variety of pretraining data investigations to better understand the potential capabilities and limitations of the models (Pages 5-7).
Key Insights:
The pretraining process for Llama 2 models involved several improvements over the previous version, including more robust data cleaning, updated data mixes, increased training volume, and the use of GQA for improved inference scalability. The training data was carefully selected and cleaned to ensure quality and privacy.
Actionable Advice:
For those building LLMs, it's crucial to invest in the pretraining process. This includes robust data cleaning, careful selection of training data, and the use of advanced techniques such as GQA for improved performance. It's also important to perform pretraining data investigations to understand the potential capabilities and limitations of your models. This can help in identifying potential issues early on and guide the development process. Lastly, always consider privacy and ethical considerations when selecting training data.