Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)

Abstract Commentary & Rating

Prof. Otto NomosMay 27, 2024 2 min read

Published on Sep 16

Authors:Parsa Kavehzadeh,Mojtaba Valipour,Marzieh Tahaei,Ali Ghodsi,Boxing Chen,Mehdi Rezagholizadeh


The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP). While these models excel at understanding and generating human-like text, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference for deep neural networks. It leverages network modularity to create sub-models with varying computational loads, sorting them based on computation/accuracy characteristics in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that using this approach, we are able to unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. By applying this approach on LLaMa 2 13B for tuning on the Stanford Alpaca dataset and comparing it to normal tuning and early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver models twice as fast as the original model while maintaining or exceeding performance.

View arXiv pageView PDF


The paper titled "Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)" delves into the realm of making Large Language Models (LLMs) more efficient and cost-effective. The primary focus is on allowing dynamic inference without requiring significant adjustments to the models.

Key Insights:

  1. Efficiency Problem: The paper acknowledges the challenge with LLMs—they're computationally expensive, making real-world deployment challenging, especially in real-time or latency-sensitive applications.

  2. SortedNet Adaptation: The authors extend the SortedNet technique (previously applied to deep neural networks) to NLP tasks, specifically generative ones. This approach aims to dynamically adjust the model's depth during inference, essentially using only the necessary computation to produce an answer.

  3. SoFT Over SFT: The proposal is to replace the standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). This change doesn't increase the costs but promises better efficiency.

  4. Potential of Intermediate Layers: A key takeaway is that not all layers in a transformer are necessarily required for every task. The potential of intermediate layers can be unlocked for target output generation, which can be more computationally efficient.

  5. Performance Gains: The proposed method offers models that can be twice as fast as the original, with the same or better performance.

Potential Real-World Impact:

  • Cost-Effective Deployment: For companies or applications that leverage LLMs, this approach could significantly reduce computational costs, making widespread deployment more feasible.

  • Real-time Applications: With faster models, applications that require real-time language processing—like chatbots, virtual assistants, and more—can benefit immensely.

  • Storage and Transition Benefits: As the sub-models remain part of the original model, storage requirements aren't increased. Transitioning between computational budgets becomes smoother and more efficient.

  • Customization: Depending on the computational constraints of a particular application, users can choose the appropriate model depth, providing flexibility.

  • Broad Applicability: Given that this is a fine-tuning approach, it could be applied to various LLMs across diverse domains.


  • Adoption Time: It might take some time for businesses and developers to adopt and adjust to this new fine-tuning approach.

  • Domain-Specific Challenges: The effectiveness of this method across a diverse range of domains and tasks remains to be extensively tested.

Given the rising importance of LLMs in numerous applications and the ever-present need to optimize computational costs without compromising performance:

I'd rate the real-world impact of this paper as a 9 out of 10.

The ability to harness the capabilities of LLMs more efficiently could drastically change how these models are deployed, making them more ubiquitous in various applications.

Share this article
Subscribe For The Latest Updates Subscribe to the newsletter and never miss the new post every week.