• About
  • Privacy Policy
  • Disclaimer
  • Contact
Soft Bliss Academy
No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
Soft Bliss Academy
No Result
View All Result
Home Machine Learning

Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

softbliss by softbliss
June 9, 2025
in Machine Learning
0
Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Mixture-of-Experts (MoEs) architectures offer a promising solution by sparsely activating specific parts of the model, reducing the inference overhead. However, even with MoEs, the sheer number of parameters and experts makes deployment and serving costly.

Pruning is an established method to reduce the number of parameters of a trained model while maintaining its task performance. Typically, we distinguish two kinds of approaches. Unstructured pruning removes individual weights, while structured pruning removes entire model components.

Due to their clear structure, structured pruning seems to be an ideal match for MoEs. By removing redundant experts, we can shrink the total model size. However, current approaches for expert pruning require many forward passes, whose number grows exponentially with the number of experts. Further, structured pruning does not reduce the number of active weights during inference.

In our paper STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, which was accepted for a presentation at ACL 2025, we combine the two classes of pruning methods and introduce an approach that works exceptionally well for MoEs with over 100 experts. In a nutshell, STUN first removes redundant experts and then performs unstructured pruning inside individual experts.

Scaling barriers for Mixture of Expert models

MoEs are an effective technique to increase the total number of model parameters while keeping computational demands in check. By dividing the model into specialized structures, called experts, and selectively activating them based on the input, MoEs achieve efficiency gains in training and inference.

More experts allow the model to capture a broader range of representations and specializations, improving performance on diverse tasks or complex data. Unsurprisingly, we see a clear trend towards an increased number of experts in MoEs. To illustrate this evolution, Mistral’s Mixtral 8x7B (December 2023) builds on eight experts, Databricks’ DBRX (March 2024) on 16, and Snowflake’s Arctic (April 2024) uses 128 experts.

However, as models scale further, the efficiency gains provided by the MoE architecture alone are insufficient. Here, pruning becomes essential, refining the architecture by removing redundant parameters without compromising overall performance. Combining MoEs with pruning techniques can optimize inference speed and memory consumption, making it a promising direction for further scaling models.

Solving the exponential scaling challenge in structured MoE pruning

Structured pruning removes specific patterns, such as rows or entire weight tensors. In the context of MoEs, as expert structures from training MoEs correspond to such patterns, pruning experts is a natural fit for structured pruning.

While an increase from 8 to 128 experts may seem modest, it renders current pruning methods unviable. Roughly speaking, they take a ā€œcombinatorialā€ approach to determining which structures to remove, requiring the enumeration of all possible subsets of experts to determine the optimal configuration. To illustrate, when the number of experts increases from 8 to 128, the forward passes of combinatorial pruning algorithms grow exponentially, from 70 to 2.4 Ɨ 10³⁷.

In contrast, STUN leverages the behavioral similarity between experts to make informed pruning decisions. Specifically, it first identifies clusters of similar experts based on their behavioral similarity. We can determine the similarity at a minimal cost by inspecting the model’s weights. If two rows have similar values, this suggests a high pairwise similarity between the two corresponding experts. Such an expert pair tends to activate on similar inputs and exhibit similar outputs, thus forming a cluster.

By pruning all but one representative expert from each cluster, STUN effectively reduces the model size while preserving its overall functionality. This approach drastically reduces the exponential complexity of exhaustively enumerating combinations to constant O(1), making it highly scalable for massive MoEs.

Exploring the potential of a two-phase approach to MoE pruning

A key question in our research was: How much can we gain from an additional unstructured pruning phase? After we remove all redundant experts, there might be less ā€œmarginā€ for further pruning compared to a scenario where we exclusively apply unstructured pruning.

We can quantify this margin as the kurtosis of the model weights’ distribution, colloquially known as its ā€œtailedness.ā€ As unstructured pruning removes near-zero weights, it reduces the weight distribution’s kurtosis.

Unlike unstructured pruning, which selectively targets weights that minimally impact the model’s output, structured pruning removes groups of parameters (in our case, experts) based on redundancy or low importance. Thus, structured pruning does not significantly decrease kurtosis, leaving plenty of margin for unstructured pruning.Ā 

For instance, if two experts in an MoE perform identically, one can be removed without altering the model’s output. Still, this does not significantly influence the overall weight distribution—it only reduces the model’s size.

Since structured pruning primarily reduces architectural redundancy rather than reshaping the underlying weight distribution, our two-phase approach—leveraging unstructured pruning after structured pruning—outperforms unstructured-only pruning.

Putting STUN to the test

Our evaluations show that STUN achieves high sparsity with no loss in performance on various MoE architectures, including Snowflake’s Arctic, a 480B-sized MoE with 128 experts.

We achieved nearly no loss in performance with 40% sparsity, even on challenging generative tasks like GSM8K (Grade School Math 8K), a widely adopted question answering task testing on mathematical problems that require multi-step reasoning.

GSM8K 5-shot accuracy for Snowflake Arctic, a 480B Mixture-of-Experts model, after applying different pruning strategies to varying degrees. Structured-only pruning exhibits a significant performance loss as more and more experts are removed. (A sparsity of 30% corresponds to just 90 of the original 128 experts left.) Unstructured-only pruning maintains an unchanged performance up to the point where 30% of the weights are removed. With STUN, the combination of both approaches, benchmark performance remains virtually unaffected up to a sparsity of 40%. This demonstrates that the strategic removal of redundant experts, followed by unstructured pruning, outperforms structured-only and unstructured-only pruning.
GSM8K 5-shot accuracy for Snowflake Arctic, a 480B Mixture-of-Experts model, after applying different pruning strategies to varying degrees. Structured-only pruning exhibits a significant performance loss as more and more experts are removed. (A sparsity of 30% corresponds to just 90 of the original 128 experts left.) Unstructured-only pruning maintains an unchanged performance up to the point where 30% of the weights are removed. With STUN, the combination of both approaches, benchmark performance remains virtually unaffected up to a sparsity of 40%. This demonstrates that the strategic removal of redundant experts, followed by unstructured pruning, outperforms structured-only and unstructured-only pruning. | Source

In some cases, STUN performed orders of magnitude better than unstructured pruning methods. Our O(1) expert pruning method also outperformed existing, more computationally expensive methods, such as Lu et al. (2024), highlighting the effectiveness of our approach.

What’s next in MoE pruning?

Since STUN does not make any assumption about base MoE models, it is generalizable to other MoE families, such as Mixtral. Our code is available on GitHub. We encourage you to read our paper and adapt it to your MoE models.

Beyond applying and evaluating STUN, a crucial next area of optimization is hardware acceleration for unstructuredly pruned models. Unstructured pruning removes individual weights without considering their location or arrangement in the model. Because of this, the resulting model’s sparsity is random and unaligned—some rows, columns, or even small sections may become very sparse, while others remain dense.

This irregularity is challenging because hardware like GPUs or TPUs assumes regular, contiguous memory layouts. While structured pruning yields a predictable sparsity pattern that allows for memory optimization, the irregularly sparse models resulting from unstructured pruning prevent efficient memory access and parallel processing.

Specialized hardware support can reorganize memory access patterns to reduce overheads from irregularity. Such co-evolution of hardware and software support will likely further establish pruning as a cornerstone of scaling and applying MoE models.

Was the article useful?


Yes


No

Explore more content topics:

Tags: MoEPaperPruningReflectionScalableStructuredThenUnstructured
Previous Post

Getswift cofounder Joel Macdonald is back offering advice on dealing with setbacks

Next Post

Norma Kamali is transforming the future of fashion with AI | MIT News

softbliss

softbliss

Related Posts

Machine Learning

Improve Vision Language Model Chain-of-thought Reasoning

by softbliss
June 9, 2025
AI model deciphers the code in proteins that tells them where to go | MIT News
Machine Learning

AI model deciphers the code in proteins that tells them where to go | MIT News

by softbliss
June 9, 2025
Google Search AI Mode now offers data visualization and charts
Machine Learning

Google Search AI Mode now offers data visualization and charts

by softbliss
June 8, 2025
Top 7 AWS Services for Machine Learning
Machine Learning

Top 7 AWS Services for Machine Learning

by softbliss
June 8, 2025
šŸš€ 5 Powerful Open Source Projects Backed by Big Tech Companies — and Changing the World of Development | by TechTales | Jun, 2025
Machine Learning

šŸš€ 5 Powerful Open Source Projects Backed by Big Tech Companies — and Changing the World of Development | by TechTales | Jun, 2025

by softbliss
June 8, 2025
Next Post
Norma Kamali is transforming the future of fashion with AI | MIT News

Norma Kamali is transforming the future of fashion with AI | MIT News

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Premium Content

From Idea to Impact: High-Quality Content Creation Made Easy With Specialized AI Apps

March 26, 2025
I Tested 50+ AI Headshot Generators — These Are the Only Ones I’d Trust Again | by Nitin Sharma | The Startup | May, 2025

I Tested 50+ AI Headshot Generators — These Are the Only Ones I’d Trust Again | by Nitin Sharma | The Startup | May, 2025

May 20, 2025
Fixing Cumulative Layout Shift Problems on DavidWalshBlog

Fixing Cumulative Layout Shift Problems on DavidWalshBlog

May 25, 2025

Browse by Category

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Browse by Tags

Amazon App Apps Artificial Blog Build Building Business Coding Data Development Digital Framework Future Gemini Generative Google Guide Impact Innovation Intelligence Key Language Large Learning LLM LLMs Machine Microsoft MIT model Models News NVIDIA opinion OReilly Research Series Software Startup Startups students Tech Tools Video

Soft Bliss Academy

Welcome to SoftBliss Academy, your go-to source for the latest news, insights, and resources on Artificial Intelligence (AI), Software Development, Machine Learning, Startups, and Research & Academia. We are passionate about exploring the ever-evolving world of technology and providing valuable content for developers, AI enthusiasts, entrepreneurs, and anyone interested in the future of innovation.

Categories

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Recent Posts

  • Google DeepMind and Isomorphic Labs introduce AlphaFold 3 AI model
  • Improve Vision Language Model Chain-of-thought Reasoning
  • Adding support for Google Pay within Android WebView

Ā© 2025 https://softblissacademy.online/- All Rights Reserved

No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups

Ā© 2025 https://softblissacademy.online/- All Rights Reserved

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?