Original Link: MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML
Summary
In an episode of the Latent Space podcast, Alessio and co-host Swyx welcome guests Jonathan and Abhinav (Abhi) from Mosaic ML.
Key Takeaways:
Introductions:
Jonathan, a former student of Princeton and MIT, made significant contributions with his research on the "lottery ticket hypothesis" in 2018. He delayed his PhD defense for two years due to commitments with Mosaic and has been appointed as an assistant professor at Harvard.
Abhinav, an MIT graduate and former researcher at Cerebras, now works with Mosaic. He mentions Cerebras' innovative approach to use an entire wafer for computing, introducing a wafer-scale computing system for training models.
Mosaic's Journey:
Mosaic ventured into building its own model, the MPT-7B, after profiling various models and realizing training costs could be considerably lowered. Mosaic's focus is not on a single standout model but on empowering clients to create their own optimized models using Mosaic’s tools.
Training and Model Creation:
Mosaic initiated its MPT-7B project as a base model, inspired by LLaMA 7B, trained on a trillion tokens. They over-trained the model intentionally to ensure it was effective for inference. Abhinav mentions a term "chinchilla laws" which dictate efficient compute spend, and this was one principle guiding their training decisions.
Data Choices:
The duo discussed the challenge of determining the right balance between quality and quantity of data for model training. Repetition of high-quality data might be as effective, if not more so, than a larger volume of diverse, lower-quality data.