- Updated: March 24, 2026
- 8 min read
TinyLoRA: 13‑Parameter Fine‑Tuning Method Hits 91.8% GSM8K on Qwen2‑5‑7B
Discord Linkedin Reddit X Home Open Source/Weights AI Agents Tutorials Voice AI AINews.sh Sponsorship Search NewsHub NewsHub Premium Content Read our exclusive articles FacebookInstagramX Home Open Source/Weights AI Agents Tutorials Voice AI AINews.sh Sponsorship NewsHub Search Home Open Source/Weights AI Agents Tutorials Voice AI AINews.sh Sponsorship Home Tech News AI Paper Summary This AI Paper Introduces TinyLoRA, A 13-Parameter Fine-Tuning Method That Reaches 91.8.Tech NewsAI Paper SummaryTechnologyAI ShortsArtificial IntelligenceApplicationsEditors PickLanguage ModelLarge Language ModelMachine LearningStaff Researchers from FAIR at Meta, Cornell University, and Carnegie Mellon University have demonstrated that large language models (LLMs) can learn to reason using a remarkably small number of trained parameters. The research team introduces TinyLoRA, a parameterization that can scale down to a single trainable parameter under extreme sharing settings.Using this method on a Qwen2.5-7B-Instruct backbone, the research team achieved 91.8% accuracy on the GSM8K benchmark with only 13 parameters, totaling just 26 bytes in bf16. Overcoming the Constraints of Standard LoRA Standard Low-Rank Adaptation (LoRA) adapts a frozen linear layer W ∈ Rdxk using trainable matrices A ∈ Rdxr and B ∈ Rrxk. The trainable parameter count in standard LoRA still scales with layer width and rank, which leaves a nontrivial lower bound even at rank 1.For a model like Llama3-8B, this minimum update size is approximately 3 million parameters. TinyLoRA circumvents this by building upon LoRA-XS, which utilizes the truncated Singular Value Decomposition (SVD) of frozen weights. While LoRA-XS typically requires at least one parameter per adapted module, TinyLoRA replaces the trainable matrix with a low-dimensional trainable vector 𝜐 ∈ Ru projected through a fixed random tensor P ∈ Ruxrxr.The update rule is defined as: $$W’ = W + U\Sigma(\sum_{i=1}^{u}v_{i}P_{i})V^{\top}$$ By applying a weight tying factor (ntie), the total trainable parameters scale as O(nmu/ntie), allowing updates to scale down to a single parameter when all modules across all layers share the same vector.Reinforcement Learning: The Catalyst for Tiny Updates A core finding of the research is that Reinforcement Learning (RL) is fundamentally more efficient than Supervised Finetuning (SFT) at extremely low parameter counts. The research team reports that models trained via SFT require updates 100 to 1,000 times larger to reach the same performance as those trained with RL. This gap is attributed to the ‘information density’ of the training signal.SFT forces a model to absorb many bits of information—including stylistic noise and irrelevant structures of human demonstrations—because its objective treats all tokens as equally informative. In contrast, RL (specifically Group Relative Policy Optimization or GRPO) provides a sparser but cleaner signal. Because rewards are binary (e.g., exact match for a math answer), reward-relevant features correlate with the signal while irrelevant variations cancel out through resampling.Optimization Guidelines for Devs The research team isolated several strategies to maximize the efficiency of tiny updates: Optimal Frozen Rank (r): Analysis showed that a frozen SVD rank of r=2 was optimal. Higher ranks introduced too many degrees of freedom, complicating the optimization of the small trainable vector. Tiling vs.Structured Sharing: The research team compared ‘structured’ sharing (modules of the same type share parameters) with ’tiling‘ (nearby modules of similar depth share parameters). Surprisingly, tiling was more effective, showing no inherent benefit to forcing parameter sharing exclusively between specific projections like Query or Key modules.Precision: In bit-constrained regimes, storing parameters in fp32 proved most performant bit-for-bit, even when accounting for its larger footprint compared to bf16 or fp16. Benchmark Performance The research team reports that Qwen-2.5 models often needed around 10x fewer updated parameters than LLaMA-3 to reach similar performance in their setup. ModelParameters TrainedGSM8K Pass@1Qwen2.5-7B-Instruct (Base)088.2%Qwen2.5-7B-Instruct182.0%Qwen2.5-7B-Instruct1391.8%Qwen2.5-7B-Instruct19692.2%Qwen2.5-7B-Instruct (Full FT)~7.6 Billion91.7% On harder benchmarks like MATH500 and AIME24, 196-parameter updates for Qwen2.5-7B-Instruct retained 87% of the absolute performance improvement of full finetuning across six difficult math benchmarks. Key Takeaways Extreme Parameter Efficiency: It is possible to train a Qwen2.5-7B-Instruct model to achieve 91.8% accuracy on the GSM8K math benchmark using only 13 parameters (26 total bytes).The RL Advantage: Reinforcement Learning (RL) is fundamentally more efficient than Supervised Finetuning (SFT) in low-capacity regimes; SFT requires 100–1000x larger updates to reach the same performance level as RL. TinyLoRA Framework: The research team developed TinyLoRA, a new parameterization that uses weight tying and random projections to scale low-rank adapters down to a single trainable parameter.Optimizing the “Micro-Update”: For these tiny updates, fp32 precision is more bit-efficient than half-precision formats , and “tiling” (sharing parameters by model depth) outperforms structured sharing by module type. Scaling Trends: As models grow larger, they become more ‘programmable’ with fewer absolute parameters, suggesting that trillion-scale models could potentially be tuned for complex tasks using just a handful of bytes. Check out the Paper.Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.RELATED ARTICLESMORE FROM AUTHOR Paged Attention in Large Language Models LLMs A Coding Implementation to Design Self-Evolving Skill Engine with OpenSpace for Skill Learning, Token Efficiency, and Collective Intelligence Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling Meta AI’s New Hyperagents Don’t Just Solve Tasks—They Rewrite the Rules of How They Learn Luma Labs Launches Uni-1: The Autoregressive Transformer Model that Reasons through Intentions Before Generating Images How to Design a Production-Ready AI Agent That Automates Google Colab Workflows Using Colab-MCP, MCP Tools, FastMCP, and Kernel Execution Paged Attention in Large Language Models LLMs Arham Islam – March 24, 2026 0 When running LLMs at scale, the real limitation is GPU memory rather than compute, mainly because each request requires a KV cache to store. A Coding Implementation to Design Self-Evolving Skill Engine with OpenSpace for Skill Learning, Token. Michal Sutter – March 24, 2026 0 In this tutorial, we explore OpenSpace, a self-evolving skill engine developed by HKUDS that makes AI agents smarter, more cost-efficient, and capable of learning.Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling Asif Razzaq – March 23, 2026 0 World Models (WMs) are a central framework for developing agents that reason and plan in a compact latent space. However, training these models directly. Meta AI’s New Hyperagents Don’t Just Solve Tasks—They Rewrite the Rules of How They.Asif Razzaq – March 23, 2026 0 The dream of recursive self-improvement in AI—where a system doesn’t just get better at a task, but gets better at learning—has long been the. Luma Labs Launches Uni-1: The Autoregressive Transformer Model that Reasons through Intentions Before Generating. Michal Sutter – March 23, 2026 0 In the field of generative AI media, the industry is transitioning from purely probabilistic pixel synthesis toward models capable of structural reasoning. Luma Labs.How to Design a Production-Ready AI Agent That Automates Google Colab Workflows Using Colab-MCP,. Asif Razzaq – March 23, 2026 0 In this tutorial, we build an advanced, hands-on tutorial around Google’s newly released colab-mcp, an open-source MCP (Model Context Protocol) server that lets any. How BM25 and RAG Retrieve Information Differently?Arham Islam – March 22, 2026 0 When you type a query into a search engine, something has to decide which documents are actually relevant — and how to rank them. Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train. Asif Razzaq – March 22, 2026 0 In this tutorial, we implement a reinforcement learning agent using RLax, a research-oriented library developed by Google DeepMind for building reinforcement learning algorithms with.Meet GitAgent: The Docker for AI Agents that is Finally Solving the Fragmentation between. Michal Sutter – March 22, 2026 0 The current state of AI agent development is characterized by significant architectural fragmentation. Software devs building autonomous systems must generally commit to one of. A Coding Implementation for Building and Analyzing Crystal Structures Using Pymatgen for Symmetry Analysis,.Michal Sutter – March 21, 2026 0 In this tutorial, we explore the capabilities of the pymatgen library for computational materials science using Python. We begin by constructing crystal structures such.Discord Linkedin Reddit X miniCON Event 2025 Download AI Magazine/Report Privacy & TC Cookie Policy 🐝 Partnership and Promotion © Copyright Reserved @2025 Marktechpost AI Media Inc We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies. Do not sell my personal information.Cookie settingsACCEPTPrivacy & Cookies Policy [{“Model”:”Qwen2.5-7B-Instruct”,”Parameters Trained”:”1″,”GSM8K Pass@1″:”82.0%”},{“Model”:”Qwen2.5-7B-Instruct”,”Parameters Trained”:”13″,”GSM8K Pass@1″:”91.8%”},{“Model”:”Qwen2.5-7B-Instruct”,”Parameters Trained”:”196″,”GSM8K Pass@1″:”92.2%”},{“Model”:”Qwen2.5-7B-Instruct (Full FT)”,”Parameters Trained”:”~7.6 Billion”,”GSM8K Pass@1″:”91.7%”}]