Latent Space Podcast 7/26/23 [Summary] FlashAttention 2: making Transformers 800% faster - Tri Dao of Together AI

Discover how FlashAttention revolutionized AI speed with Tri Dao, as he unveils the power of FlashAttention 2, dives into Stanford's Hazy Lab & future AI insights.

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Introduction

On the Latent Space podcast, Alessio, a Partner and CTO-in-Residence at Decibel Partners, hosts a discussion with guest Tri Dao. Tri recently completed his PhD at Stanford and is a main author of the groundbreaking FlashAttention paper pivotal in the Transformers era. Tri shares insights into efficient transformer training, inference, and long-range sequence models. He is set to be an assistant professor in Computer Science at Princeton in the coming year. Tri also recently joined as the Chief Scientist at the company, Together, which is responsible for RedPajama.

Tri reveals a personal tidbit that he initially intended to major in economics during his early days at Stanford, but after taking math classes, he shifted his focus to mathematics. This decision played a significant role in steering him towards his current career in math, computer science, and AI research.

The discussion delves deep into FlashAttention and its recently released successor, FlashAttention 2. The innovation in FlashAttention is its capability to scale linearly, as opposed to the traditional quadratic scaling. Tri emphasizes the importance of avoiding approximation in attention mechanisms. He explains that while other methods focus on approximating attention, their main objective was efficiency and memory. Their approach saw a wall-clock speed up of 2 to 4 times, making training 2 to 4 times longer possible without added costs.

A significant aspect of their innovation involves merging ideas from both machine learning and system designs, particularly kernel fusion. This technique optimizes memory reading and writing, which consumes most of the time in attention mechanisms. While kernel fusion has its merits, Tri acknowledges that it may limit flexibility, especially for researchers keen on tweaking the attention process. However, the benefits are primarily in leveraging faster memory (SRAM) compared to the more massive but slower memory (HBM), capitalizing on the asymmetric memory hierarchy present in GPUs.

Memory Hierarchies in Hardware: There are multiple levels of memory storage in hardware systems:

SRAM (Static Random-Access Memory) is faster but much smaller. Its size is unlikely to grow substantially due to spatial constraints on-chip.
HBM (High Bandwidth Memory) is larger and resides off-chip. Its growth potential is larger due to having more space.

Challenges and Evolution:

The inherent spatial constraints on SRAM might prevent it from getting much larger in size.
HBM's growth is forecasted, both in size and speed.
There's a strong emphasis on designing algorithms that account for this memory disparity. Just like CPUs have small cache sizes but vast DRAM, algorithms must learn to efficiently utilize these variances.

FlashAttention's Relevance:

Attention mechanisms in neural networks, like FlashAttention, have proven their worth over time. Even though the exact implementations might change, the foundational ideas are expected to remain.
Attention is anticipated to be pivotal in state-of-the-art architectures in the coming years.

Research Popularity & Utility:

Tri discusses the unpredicted popularity of FlashAttention and emphasizes the significance of code as an artifact in research. It's not just about presenting an idea but ensuring that it can be efficiently and effectively utilized by others.

Hazy Research Group:

Hazy Research is a diverse research group at Stanford. It incorporates experts across various domains from algorithms, systems, to applications.
This diversity facilitates a robust feedback loop where theoretical ideas can be built into systems and then practically applied. Direct feedback from applications helps refine and improve theoretical concepts.
Chris Re, an advisor at Hazy Research, emphasizes understanding fundamental concepts, which aids in creating more impactful research.

Academia vs Industry in AI/ML Research:

Alessio inquired about the balance and comparison between academia and industry, especially in the field of AI/ML.
Tri believes both sectors play complementary roles. Industry has the advantage in scaling due to access to resources, such as computing power. However, many foundational ideas, like the Attention mechanism, originated from academia.
Post the success of models like GPT-2, companies like OpenAI emphasized scaling, achieving remarkable results.
Academia focuses on evaluations, understanding the underpinnings of models, and taking riskier research bets. They have the freedom to delve deeper into understanding and even undertake projects with a lower chance of success.
Tri suggests industry may offer better compensation and work-life balance, while academia offers more intellectual freedom. Career choice depends on individual preference.

Role of Evaluations:

Alessio highlighted how benchmarks can influence model development since models need to score well on them to gain attention and funding.
Tri emphasized the importance of evaluations and benchmarks. He notes that both academia and industry contribute to the field, understanding emerging use cases and ensuring advancements.

FlashAttention 2 & NVIDIA:

Tri introduced FlashAttention 2, a project developed over months, which started as an exploration of NVIDIA’s CUTLASS library but evolved into a tool that is twice as fast.
Currently, FlashAttention 2 works on NVIDIA GPUs. However, the main idea of addressing memory hierarchy asymmetry is universal and can be applied across different hardware.

Hardware Lottery:

Alessio referred to Sara Hooker’s idea of the "hardware lottery", where potential better architectures may not see the light of day because they aren't optimized for the dominant hardware like NVIDIA.
Tri acknowledged the hardware lottery and the feedback loop it creates with software frameworks. For example, since transformers are currently dominant, most optimization work centers around them.
Compilers might offer a way out of this cycle. They allow for efficient performance across diverse hardware platforms. Tri cited the Mojo language as an example, as it aims to make AI models run efficiently on various devices.

AI Chips and On-chip Memory:

Alessio inquires about AI chip companies like Cerebras that focus on integrating everything onto the chip to combat memory bandwidth issues.
Tri acknowledges the promising direction, mentioning Tesla's Dojo supercomputer, which seeks to maximize on-chip memory speed and eliminate repetitive data transfers. A challenge is the high manufacturing cost of on-chip memory, which is pricier per gigabyte than off-chip memory. Tri cites Cerebros, which has overcome some of these obstacles with its proprietary software stack and compiler.
He also points out the complexity of supporting tools like PyTorch on such hardware, given the rapid evolution of AI models and the longer time frame required for hardware development.

Influence of Industry Pace on PhD Research:

Alessio queries the influence of the rapidly progressing industry on research, particularly in cases where newer model architectures might make older topics obsolete.
Tri reflects on the challenges faced by researchers and emphasizes the importance of understanding the fundamentals. He shares his own PhD experience and believes that acquiring foundational knowledge and skills is crucial for evolving as a researcher.

Transformer Alternatives:

Alessio brings up the potential alternatives to Transformer models.
Tri references a wager between Jonathan Franco and Sasha Rush on this subject. He highlights several promising Transformer alternatives that have emerged, such as state space methods, which offer better performance for capturing long-range information without quadratic scaling. He also discusses the resurgence of recurrent neural networks (RNNs) adapted for today's AI landscape.
Tri emphasizes the academic quest to determine whether attention in models is essential. He suggests that alternative architectures might be more suited for applications with long sequences (like high-resolution images or audio) or those that require high-throughput generation. Tri is optimistic about RNNs for their potential in batch processing.

Open-source AI:

Alessio comments on the evolution and ambiguity of what defines "open-source" in AI, drawing comparisons between software licenses and the transparency of AI models and datasets. He mentions the introduction of models such as Red Pajama, LLAMA1, and LLAMA2 and their impact on the AI industry. LLAMA2 is especially notable as its weightings are available to the public, amounting to $3 million of computational power donated to the public domain.
Tri acknowledges the contribution of Meta in training LLAMA1 and LLAMA2 and praises the reduced restrictions on the latter. He predicts a significant impact on the open-source AI landscape due to the usability of models like LLAMA2 in business settings. Tri emphasizes the shift in the balance of power from closed-source models to open models. He highlights the importance of democratizing decision-making in AI rather than concentrating it in the hands of a few corporations.
The conversation shifts to datasets, with Alessio opining that open datasets have a greater impact than open models. Tri speaks about the challenges and rewards of releasing datasets, pointing out the need to incentivize data release. He cites the Dolly-15K dataset as a positive instance of a company championing open-source datasets.
Discussing his journey, Tri reveals his reasons for joining "Together," a company focused on open-source models. He appreciates the company's philosophy, alignment with his values, and the chance to conduct research in areas he's passionate about.

Concluding lightning round:

Tri mentions he was surprised by AI's newfound ability to understand jokes.
He cites "reasoning" as an exciting unsolved question in AI, emphasizing the potential need for dedicated reasoning modules in future AI models.
Tri's takeaway message is the importance of understanding both algorithms and the systems they run on, emphasizing the excitement and results found at the intersection of machine learning and systems.

Content:

Introduction

Share this article

/Related stories See All Stories

Latent Space Podcast 8/16/23 [Summary] - The Mathematics of Training LLMs — with Quentin Anthony of Eleuther AI

Prof. Otto NomosOct 05, 2023 ∙ 3 min read

Explore the math behind training LLMs with Quentin Anthony from Eleuther AI. Dive into the Transformers Math 101 article & master distributed training techniques for peak GPU performance.

Latent Space Podcast 8/10/23 [Summary]: LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Prof. Otto NomosOct 05, 2023 ∙ 6 min read

Explore the magic of MLC with Tianqi Chen: deploying 70B models on browsers & iPhones. Dive into XGBoost, TVM's creation, & the future of universal AI deployments.

Latent Space Podcast 8/4/23 [Summary] Latent Space x AI Breakdown crossover pod!

Prof. Otto NomosOct 05, 2023 ∙ 7 min read

Join AI Breakdown & Latent Space for the summer AI tech roundup: Dive into GPT4.5, Llama 2, AI tools, the rising AI engineer, and more!

Latent Space Podcast 7/19/23 [Summary] - Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

Prof. Otto NomosOct 05, 2023 ∙ 5 min read

Explore Llama 2, the latest AI breakthrough with experts Nathan Lambert, Matt Bornstein & more. Dive into datasets, benchmarks & AI predictions. Llama insights & drama await in this top podcast!

Subscribe For The Latest Updates Subscribe to the newsletter and never miss the new post every week.