Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Mixture-of-Experts (MoEs) architectures offer a promising solution by sparsely activating specific parts of the model, reducing the inference overhead. However, even with MoEs, the sheer number of parameters and experts makes deployment and serving costly.

Pruning is an established method to reduce the number of parameters of a trained model while maintaining its task performance. Typically, we distinguish two kinds of approaches. Unstructured pruning removes individual weights, while structured pruning removes entire model components.

Due to their clear structure, structured pruning seems to be an ideal match for MoEs. By removing redundant experts, we can shrink the total model size. However, current approaches for expert pruning require many forward passes, whose number grows exponentially with the number of experts. Further, structured pruning does not reduce the number of active weights during inference.

In our paper STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, which was accepted for a presentation at ACL 2 0 25, we combine the two classes of pruning methods and introduce an approach that works exceptionally well for MoEs with over 100 experts. In a nutshell, STUN first removes redundant experts and then performs unstructured pruning inside individual experts.

Scaling barriers for Mixture of Expert models

MoEs are an effective technique to increase the total number of model parameters while keeping computational demands in check. By dividing the model into specialized structures, called experts, and selectively activating them based on the input, MoEs achieve efficiency gains in training and inference.

More experts allow the model to capture a broader range of representations and specializations, improving performance on diverse tasks or complex data. Unsurprisingly, we see a clear trend towards an increased number of experts in MoEs. To illustrate this evolution, Mistral’s Mixtral 8x7B (December 2023) builds on eight experts, Databricks’ DBRX (March 2024) on 16, and Snowflake’s Arctic (April 2024) uses 128 experts.

However, as models scale further, the efficiency gains provided by the MoE architecture alone are insufficient. Here, pruning becomes essential, refining the architecture by removing redundant parameters without compromising overall performance. Combining MoEs with pruning techniques can optimize inference speed and memory consumption, making it a promising direction for further scaling models.

Solving the exponential scaling challenge in structured MoE pruning

Structured pruning removes specific patterns, such as rows or entire weight tensors. In the context of MoEs, as expert structures from training MoEs correspond to such patterns, pruning experts is a natural fit for structured pruning.

While an increase from 8 to 128 experts may seem modest, it renders current pruning methods unviable. Roughly speaking, they take a “combinatorial” approach to determining which structures to remove, requiring the enumeration of all possible subsets of experts to determine the optimal configuration. To illustrate, when the number of experts increases from 8 to 128, the forward passes of combinatorial pruning algorithms grow exponentially, from 70 to 2.4 × 10³⁷.

In contrast, STUN leverages the behavioral similarity between experts to make informed pruning decisions. Specifically, it first identifies clusters of similar experts based on their behavioral similarity. We can determine the similarity at a minimal cost by inspecting the model’s weights. If two rows have similar values, this suggests a high pairwise similarity between the two corresponding experts. Such an expert pair tends to activate on similar inputs and exhibit similar outputs, thus forming a cluster.

By pruning all but one representative expert from each cluster, STUN effectively reduces the model size while preserving its overall functionality. This approach drastically reduces the exponential complexity of exhaustively enumerating combinations to constant O(1), making it highly scalable for massive MoEs.

Exploring the potential of a two-phase approach to MoE pruning

A key question in our research was: How much can we gain from an additional unstructured pruning phase? After we remove all redundant experts, there might be less “margin” for further pruning compared to a scenario where we exclusively apply unstructured pruning.

We can quantify this margin as the kurtosis of the model weights’ distribution, colloquially known as its “tailedness.” As unstructured pruning removes near-zero weights, it reduces the weight distribution’s kurtosis.

Unlike unstructured pruning, which selectively targets weights that minimally impact the model’s output, structured pruning removes groups of parameters (in our case, experts) based on redundancy or low importance. Thus, structured pruning does not significantly decrease kurtosis, leaving plenty of margin for unstructured pruning.

For instance, if two experts in an MoE perform identically, one can be removed without altering the model’s output. Still, this does not significantly influence the overall weight distribution—it only reduces the model’s size.

Since structured pruning primarily reduces architectural redundancy rather than reshaping the underlying weight distribution, our two-phase approach—leveraging unstructured pruning after structured pruning—outperforms unstructured-only pruning.

Putting STUN to the test

Our evaluations show that STUN achieves high sparsity with no loss in performance on various MoE architectures, including Snowflake’s Arctic, a 480B-sized MoE with 128 experts.

We achieved nearly no loss in performance with 40% sparsity, even on challenging generative tasks like GSM8K (Grade School Math 8K), a widely adopted question answering task testing on mathematical problems that require multi-step reasoning.

GSM8K 5-shot accuracy for Snowflake Arctic, a 480B Mixture-of-Experts model, after applying different pruning strategies to varying degrees. Structured-only pruning exhibits a significant performance loss as more and more experts are removed. (A sparsity of 30% corresponds to just 90 of the original 128 experts left.) Unstructured-only pruning maintains an unchanged performance up to the point where 30% of the weights are removed. With STUN, the combination of both approaches, benchmark performance remains virtually unaffected up to a sparsity of 40%. This demonstrates that the strategic removal of redundant experts, followed by unstructured pruning, outperforms structured-only and unstructured-only pruning. | Source

In some cases, STUN performed orders of magnitude better than unstructured pruning methods. Our O(1) expert pruning method also outperformed existing, more computationally expensive methods, such as Lu et al. (2024), highlighting the effectiveness of our approach.

What’s next in MoE pruning?

Since STUN does not make any assumption about base MoE models, it is generalizable to other MoE families, such as Mixtral. Our code is available on GitHub. We encourage you to read our paper and adapt it to your MoE models.

Beyond applying and evaluating STUN, a crucial next area of optimization is hardware acceleration for unstructuredly pruned models. Unstructured pruning removes individual weights without considering their location or arrangement in the model. Because of this, the resulting model’s sparsity is random and unaligned—some rows, columns, or even small sections may become very sparse, while others remain dense.

This irregularity is challenging because hardware like GPUs or TPUs assumes regular, contiguous memory layouts. While structured pruning yields a predictable sparsity pattern that allows for memory optimization, the irregularly sparse models resulting from unstructured pruning prevent efficient memory access and parallel processing.

Specialized hardware support can reorganize memory access patterns to reduce overheads from irregularity. Such co-evolution of hardware and software support will likely further establish pruning as a cornerstone of scaling and applying MoE models.

Was the article useful?

Yes

Explore more content topics:

Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Getswift cofounder Joel Macdonald is back offering advice on dealing with setbacks

Norma Kamali is transforming the future of fashion with AI | MIT News

softbliss

Related Posts

AI model deciphers the code in proteins that tells them where to go | MIT News

Google Search AI Mode now offers data visualization and charts

Top 7 AWS Services for Machine Learning

🚀 5 Powerful Open Source Projects Backed by Big Tech Companies — and Changing the World of Development | by TechTales | Jun, 2025

Build a serverless audio summarization solution with Amazon Bedrock and Whisper

Norma Kamali is transforming the future of fashion with AI | MIT News

Leave a Reply Cancel reply

Premium Content

Function calling using LLMs

How Professional Local SEO Can Elevate Your Business in Local Searches

Why I Stopped Teaching Math the Way I Was Taught in School

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Scaling barriers for Mixture of Expert models

Solving the exponential scaling challenge in structured MoE pruning

Exploring the potential of a two-phase approach to MoE pruning

Putting STUN to the test

What’s next in MoE pruning?

Was the article useful?

Explore more content topics:

Getswift cofounder Joel Macdonald is back offering advice on dealing with setbacks

Norma Kamali is transforming the future of fashion with AI | MIT News

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?