Researchers use control principle to shed needless complexity from AI models for the duration of training, reducing compute costs without sacrificing performance.
Training a large artificial intelligence model is highly-priced, not just in dollars, but in time, energy, and computational resources. Traditionally, acquiring a smaller, faster model either needs training a huge one first after which trimming it down, or training a small one from scratch and getting weaker performance.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Max Planck Institute for Intelligent Systems, European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI have now evolved a new approach that sidesteps this trade-off totally, compressing models during training, rather than after.
The approach, called CompreSSM, focuses of family of AI architectures referred to as state-space models, which power programs starting from language processing to audio generation and robotics. By using mathematical tools from control principle, the researchers can detect which parts of a model are pulling their weight and that which dead weight, earlier than surgically getting rid of the needless components early in the training method.
“It’s basically a method to make models developing smaller and faster as they’re training,” stated Makram Chahine, a PhD student in electrical engineering and computer science, CSAIL associate, and lead author of the paper. “During learning, they’re also rid off parts that are not useful to their development.”
The main insight is that the relative importance of various components within these models stabilizes exceptionally early in the course of training. Using a mathematical quantity referred to as Hankel singular values, which measure how much each internal state provides to the model’s overall behavior, the team confirmed they can dependably rank which dimensions matter and which don’t after only about 10% of the training process. Once those rankings are established, the less-important components can be safely discarded, and the ultimate 90% of training proceeds at the speed of a much smaller model.
“What’s interesting about this working is that it turns compression from an afterthought into part of the learning system itself,” stated senior author Daniela Rus, MIT professor and director of CSAIL. “Rather than of training a large model and then identifying the way to make it smaller, CompreSSM shall we the model find out its own efficient structure as it learns. That’s a basically different way to think about forming AI systems.”
The outcomes are striking. On image classification benchmarks, compressed models controls almost the same accuracy as their full-sized counterparts even as training up to 1.5 times faster. A compressed model decreased to more or less a quarter of its original state measurement finished 85.7% accuracy on the CIFAR-10 benchmark, compared to simply 81.8% for a model trained at that smaller size from scratch. On Mamba, one of the most broadly used state-space architectures, the method obtained about 4x training speedups, compressing a 128-dimensional model right down to around 12 dimensions whilst keeping aggressive overall performance.
“You get the performance of the larger model, because you capture most of the complicated dynamics during the nice and warm-up phase, then only hold the most-beneficial states,” Chahine stated. “The model stays able to perform at a higher level than training a small version from the begin.”
What makes CompreSSM distinct from existing strategies is its theoretical grounding. Conventional pruning strategies train a complete model after which strip away parameters after the fact, meaning you continue to pay the whole computational value of training the large model. Knowledge distillation, another famous approach, demand training a large “teacher” model to finishing touch and then training a second, smaller “student” model on top of it, importantly doubling the training effort. CompreSSM avoids both of those charges through making informed compression decisions mid-stream.
The team benchmarked CompreSSM head-to-head against each options. Compared to Hankel nuclear norm regularization, a currently proposed spectral method for inspiring compact state-space models, CompreSSM was more than 40 time quicker, even as also accomplishing higher accuracy. The regularization approach slowed training around 16 times as it needed expensive eigenvalue computations at every single gradient step, or even then, the resulting models underperformed. Against knowledge distillation on CIFAR-10, CompressSM held a clean benefit for highly compressed models: At smaller state dimensions, distilled models saw huge accuracy drops, whilst CompreSSM-compressed models maintained near-full overall performance. And because distillation demand a ahead pass by both the teacher and student at each training step, even its smaller student models trained slower than the full-sized baseline.
The researchers proved mathematically that the importance of individual model states changes smoothly throughout training, way to an application of Weyl’s theorem, and showed empirically that the relative scores of these states stays to be stable. Together, these findings provide practitioners confidence that dimensions diagnosed as negligible early on may not all at once become essential later.
The approach also comes with a pragmatic protection net. If a compression step reasons an sudden performance drop, practitioners can revert to a earlier saved checkpoint. “It gives people control over how much they’re willing to pay in terms of performance, instead of having to define a much less-intuitive energy threshold,” Chahine explains.
There are a few practical boundaries to the method. CompreSSM works best on models that show off a robust correlation between the internal state dimension and overall performance, a property that varies across tasks and architectures. The approach is in same effective on multi-input, multi-output (MIMO) models, in which the relationship between state size and expressivity is strongest. For per channel, single-input, single-output architectures, the profits are more modest, since the those models are much less sensitive to state measurement changes within the first place.
The concept applies most cleanly to linear time-invariant structures, despite the team has evolved extensions for the increasingly popular input-dependent, time-varying architectures. And due to the fact the family of state-space model extends to architectures like linear interest, a developing area of interest as an optional to standard transformers, the potential scope of application is broad.
Chahine and his collaborators see the work as a stepping stone. The team has already confirmed an extension to linear time-varying systems like Mamba, and future instructions consist of pushing CompreSSM further into matrix-valued dynamical systems used in linear attention mechanisms, which could carry the approach in the direction of the transformer architectures that underpin most of today’s largest AI systems.
“This needed to be the first step, because this is where the theory is neat and the method can remain principled,” Chahine says. “It’s the stepping stone to then increase to other architectures that people are the using in industry today.”
“The work of Chahine and his colleagues gives an intriguing, theoretically grounded angle on compression for modern-state -space models (SSMs),” mentioned Antonio Orvieto, ELLIS Institute Tübingen principal investigator and MPI for Intelligent Systems independent group leader, who wasn’t included within the research. “The approach presents evidence that the state dimension of those models can be effectively decreased for the duration of training and that a control-theoretic perspective can efficiently support this process. The work opens new avenues for future studies, and the proposed algorithm has the potential to become a standard approach while pre-training large SSM-based models.”
The work , which was accepted as a conference paper at the International Conference on Learning Representations 2026, might be presented later this month. It was guided, in part, by using the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.












