Latent Space Podcast 3/2/23 [Summary] - 97% Cheaper, Faster, Better, Correct AI — with Varun Mohan of Codeium Latent Space Podcast Ep. 2: Why you are holding your GPUs wrong

Unlock AI efficiency with Varun Mohan of Codeium as he reveals methods that are 97% cheaper & faster. Dive into Latent Space Podcast Ep. 2 to optimize your GPU usage!

Prof. Otto NomosOct 05, 2023 ∙ 6 min read

Link to Original: 97% Cheaper, Faster, Better, Correct AI — with Varun Mohan of Codeium

Summary

Guest: Varun Mohan from Codeium / Exafunction.

Background:

Varun studied CS at MIT and later became the tech lead manager for autonomy at Nuro, focusing on self-driving cars and AI.
Subsequently, he co-founded Exafunction with colleagues from Nuro.
Varun's team successfully cloned GitHub Copilot in a short period.

Personal Insights:

Varun is passionate about endurance sports, such as triathlons and long-distance cycling.
These sports provide a mental break for him, allowing his mind to focus solely on the immediate physical challenge.

Exafunction:

Born from Varun's experience at Nuro, Exafunction aims to simplify the complexities of deep learning infrastructure for businesses.
The company offers solutions to optimize GPU utilization, ensuring that deep learning operations are cost-effective and efficient.
They introduced techniques such as dynamic multiplexing and other solutions to address the under-utilization of GPUs by many companies.
Varun believes that many companies should use off-the-shelf architectures and models, fine-tuning existing ones like Bert and ResNet.

Code:

Exafunction's infrastructure efficiency inspired the team to consider consumer-facing products.
They see value in applications like GitHub Copilot, which provide real-time support for developers.
Varun and his team personally experienced the benefits of Copilot, especially when writing complex codes, emphasizing its potential beyond being just a tool for basic completions.

The podcast highlighted the advancements in deep learning infrastructure, the challenges in GPU optimization, and the future of real-time coding assistance tools.

Decoding AI in Code Assistance: Insights from Swyx and Varun Mohan

swyx and Varun Mohan discuss the use of AI in coding, specifically focusing on a report from Co-pilot which suggests that 60-70% of the code generated by Co-pilot ends up in code repositories. Mohan highlights a change in user behavior after adopting Co-pilot, such as more extensive documentation to prompt the model better.

Mohan explains their approach to sourcing data. They mostly utilized open-source, permissively licensed code on the public internet rather than relying on Luther AI's pile, mainly due to timing and the pile's limited subset.

The discussion touches upon the decentralization of work, with Mohan expressing surprise at the effective decentralization, given his preference for in-person collaboration. They discuss the Luther community, with Mohan noting he was more of an observer than an active participant.

They debunk the notion that AI work is straightforward, with Mohan describing various challenges and trade-offs, particularly with server infrastructure, latency, and optimizing performance.

Mohan clarifies that their product wasn't intended to be free from the start, but they realized they could offer it for free due to their efficiencies and a desire for a larger user base. They currently boast over 10,000 users, with impressive daily active numbers and growth fueled by word of mouth.

Alessio Fanelli joins the conversation, focusing on the intricacies of AI in coding, like how to estimate scale, latency versus quality trade-offs, and model outputs. Mohan advises new founders to bootstrap on top of an existing AI unless they have specific, large-scale data needs.

Mohan introduces the concept of "correctability" in AI outputs, emphasizing its importance when the AI does not produce accurate results. He elaborates on the challenges and potential solutions in AI-generated PRs in coding.

Co-Pilot's Evolution and the Landscape of AI-Driven Legal Assistance

Alessio Fanelli initiated the discussion with a query about the direction of 'co-pilot for x' and which areas show promise. Varun Mohan highlighted the rise of Harvey AI, an AI entity that acts as legal assistance, emphasizing the importance of accuracy and groundedness for such applications. Swyx drew parallels between legal language and programming, emphasizing the need for precision in both.

The conversation also touched on the intricacies of AI-driven models, their comprehensiveness, and the potential pitfalls they face. Varun highlighted that despite the advancements, AI companies often struggle to master even a single product and emphasized the need for focus. The discussion turned to the fast growth of the mentioned AI legal assistance which managed to gain contracts with firms housing thousands of lawyers, and its potential competition with Co-Pilot.

Varun then discussed the challenges in scaling AI models, noting the delicate balance between training larger models and the associated compute costs. The two pondered the reasons for specific model sizes leading to breakthroughs in performance and concluded by addressing the potential limitations and future prospects of transformers in AI modeling.

Expanding the Horizons of AI Contextual Understanding

In a spirited exchange between Alessio Fanelli, Varun Mohan, and swyx, the trio delves deep into the challenges and potential evolutions of large-scale language models (LLMs). Varun sheds light on the architectural challenges, emphasizing the importance of token limitations in model understanding. As models grow in complexity, there's an increasing need to determine the ideal token capacity for optimal functioning, especially for expansive applications like extensive codebases.

The conversation also touches upon recent research and ideas. One highlight is the 'retro' concept from DeepMind. This method involves using vast embedding databases, potentially even larger than the initial training sets, to embed new prompts. The system then identifies related documents from these embeddings, expanding the model's context, and thus, its understanding. This is a significant leap because it infuses dynamic context, enabling the model to decide the type of context it requires. In essence, it’s akin to the model fetching specific data based on its needs.

Varun further expands on the notion that future models might integrate even more deeply with embedding databases, requiring a new breed of database companies capable of meeting their demands. However, he's skeptical about current database companies' readiness for this looming paradigm shift, based on their pricing structures and technological foundations.

The conversation pivots to LLM operations. The intricacies of building and evaluating these models underscore the unique challenges of handling massive code. Data cleaning, a significant aspect, varies according to the context, be it code, legal documents, or others. The trio points out the divergence of LLM ops from traditional ML ops.

When it comes to testing and deployment, Varun mentions their in-house, old-school A/B testing, underscoring the significance of internal testing for the continuous development of their AI models.

Understanding AI's Value in Real-time Evaluation and Progress

Open-source services are available for current needs, but other tools such as Google Analytics are still utilized.
The real measure of a model's performance is its acceptance by users across languages, which is indicated by the users accepting the completions.
GitHub's extension, Co-pilot, creatively checks if the code remains unaltered after a set time, considering users may accept and later modify it.
A significant amount of data, in the order of hundreds of thousands to millions of completions, is necessary to gauge actual value from user interactions.
The value of having real users for testing cannot be understated as it provides crucial feedback for models.
A brief discussion on the platform 'po' highlighted its potential in question-answering due to its long-standing presence and brilliant leadership.
It's emphasized that while general intelligence models are good, a high-quality, task-specific dataset is still invaluable.
The intricate workings of deep learning involve complex communication networks that allow swift and efficient model training across vast data sets.
An ideal AI infrastructure would feel like a massive computer with no communication overhead but limitless computation and memory bandwidth.
Varun Mohan appreciates the Mid Journey AI product for its consistent upleveling and style-focused approach.
In the AI community, EleutherAI stands out for its open-source approach and impressive model comparable to GPT-3, all achieved by a dedicated community on Discord.
A projection for the next year in AI is that while models will improve, human creativity will likely lead to more innovative applications of existing models.
As people persistently tinker, they'll witness groundbreaking developments in AI in the coming year.

Content:

Latent Space Podcast 8/16/23 [Summary] - The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Prof. Otto NomosOct 05, 2023 ∙ 3 min read

Explore the math behind training LLMs with Quentin Anthony from Eleuther AI. Dive into the Transformers Math 101 article & master distributed training techniques for peak GPU performance.

Latent Space Podcast 8/10/23 [Summary]: LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Prof. Otto NomosOct 05, 2023 ∙ 6 min read

Explore the magic of MLC with Tianqi Chen: deploying 70B models on browsers & iPhones. Dive into XGBoost, TVM's creation, & the future of universal AI deployments.

Latent Space Podcast 8/4/23 [Summary] Latent Space x AI Breakdown crossover pod!

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Join AI Breakdown & Latent Space for the summer AI tech roundup: Dive into GPT4.5, Llama 2, AI tools, the rising AI engineer, and more!

Latent Space Podcast 7/26/23 [Summary] FlashAttention 2: making Transformers 800% faster - Tri Dao of Together AI

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Discover how FlashAttention revolutionized AI speed with Tri Dao, as he unveils the power of FlashAttention 2, dives into Stanford's Hazy Lab & future AI insights.

Subscribe For The Latest Updates Subscribe to the newsletter and never miss the new post every week.