
Can We Trust Complex Statistical Models?
From public health studies to political polling and recommendation systems, mixed models are a key tool for analyzing complex data. They help researchers consider both fixed factors—like age, income or education—and random effects, such as differences between regions, institutions, or individual behaviors.
But the more detailed these models become, the harder they are to compute. In real-world settings, where data may involve thousands of groups or millions of observations, it becomes increasingly difficult not just to make predictions but to know how much we can trust those predictions.
A recent study published in Biometrika by Omiros Papaspiliopoulos and Giacomo Zanella (Department of Decision Sciences, Bocconi) with Max Goplerud (University of Texas at Austin), “Partially Factorized Variational Inference for High-Dimensional Mixed Models”, introduces a new method that offers more reliable results in less time, even for very large and complex models.
The pitfall of cutting corners
When working with such models, researchers often rely on shortcut methods to get results quickly. One of the most common is known as variational inference, which approximates the model’s full set of possibilities using a simplified version that is easier to compute.
However, the most popular version of this shortcut—called mean-field variational inference—has a major drawback: it underestimates uncertainty. That means it may suggest high confidence even when the model is actually unsure. The authors show that this problem gets worse as the model includes more data or more variables.
This isn't just a technical detail. Underestimating uncertainty can have real-world consequences: misleading policymakers, overpromising to investors, or pushing flawed decisions in medicine, marketing, or public planning.
A smarter approximation
To address this, the authors introduce partially factorized variational inference. Instead of simplifying the model by treating all variables as unrelated, their method keeps key connections intact—allowing for better approximations of both predictions and how confident we should be about them.
Even better: the method is fast. Their results show that when the approximation is more accurate, the algorithm used to compute it also converges faster. In other words, the best answers don’t just feel more trustworthy: they actually take less time to find.
Tested in the real world
The team tested their method on large-scale examples, including sophisticated models of voter turnout that consider geography, demographics, and education levels. Their technique consistently outperformed standard alternatives: it was both more accurate and more scalable.
They also released their method in an open-source package, making it freely available to analysts, researchers, and data scientists.
More data, better results
Perhaps the most surprising insight from the study is what the authors call the blessing of dimensionality. Unlike many statistical methods that struggle as data complexity increases, this new approach actually performs better when more groups, categories, or data points are added—if the model is structured in the right way. This happens because more data creates richer patterns, which the algorithm can exploit to improve accuracy.
This conclusion draws on concepts from graph theory, showing that well-connected data—where variables interact frequently across groups—naturally supports better estimation.
Practical advice
Of course, not every dataset behaves nicely. In certain hierarchical settings—such as school systems nested within districts—relationships between variables can complicate things. The authors offer a simple strategy: when modeling complex interactions, include the basic components (e.g. region, age, or gender) directly in the core of the approximation. This makes the calculations manageable without sacrificing reliability.
M. Goplerud, O. Papaspiliopoulos, G. Zanella, "Partially factorized variational inference for high-dimensional mixed models", Biometrika, Volume 112, Issue 2, 2025, DOI https://doi.org/10.1093/biomet/asae067