Can We Trust Complex Statistical Models?

24 Jun 2025, by Andrea Costa

A new method improves the reliability and speed of predictions in large-scale data analysis

From public health studies to political polling and recommendation systems, mixed models are a key tool for analyzing complex data. They help researchers consider both fixed factors—like age, income or education—and random effects, such as differences between regions, institutions, or individual behaviors.

But the more detailed these models become, the harder they are to compute. In real-world settings, where data may involve thousands of groups or millions of observations, it becomes increasingly difficult not just to make predictions but to know how much we can trust those predictions.

A recent study published in Biometrika by Omiros Papaspiliopoulos and Giacomo Zanella (Department of Decision Sciences, Bocconi) with Max Goplerud (University of Texas at Austin), “Partially Factorized Variational Inference for High-Dimensional Mixed Models”, introduces a new method that offers more reliable results in less time, even for very large and complex models.

The pitfall of cutting corners

When working with such models, researchers often rely on shortcut methods to get results quickly. One of the most common is known as variational inference, which approximates the model’s full set of possibilities using a simplified version that is easier to compute.

However, the most popular version of this shortcut—called mean-field variational inference—has a major drawback: it underestimates uncertainty. That means it may suggest high confidence even when the model is actually unsure. The authors show that this problem gets worse as the model includes more data or more variables.

This isn't just a technical detail. Underestimating uncertainty can have real-world consequences: misleading policymakers, overpromising to investors, or pushing flawed decisions in medicine, marketing, or public planning.

A smarter approximation

To address this, the authors introduce partially factorized variational inference. Instead of simplifying the model by treating all variables as unrelated, their method keeps key connections intact—allowing for better approximations of both predictions and how confident we should be about them.

Even better: the method is fast. Their results show that when the approximation is more accurate, the algorithm used to compute it also converges faster. In other words, the best answers don’t just feel more trustworthy: they actually take less time to find.

Tested in the real world

The team tested their method on large-scale examples, including sophisticated models of voter turnout that consider geography, demographics, and education levels. Their technique consistently outperformed standard alternatives: it was both more accurate and more scalable.

They also released their method in an open-source package, making it freely available to analysts, researchers, and data scientists.

More data, better results

Perhaps the most surprising insight from the study is what the authors call the blessing of dimensionality. Unlike many statistical methods that struggle as data complexity increases, this new approach actually performs better when more groups, categories, or data points are added—if the model is structured in the right way. This happens because more data creates richer patterns, which the algorithm can exploit to improve accuracy.

This conclusion draws on concepts from graph theory, showing that well-connected data—where variables interact frequently across groups—naturally supports better estimation.

Practical advice

Of course, not every dataset behaves nicely. In certain hierarchical settings—such as school systems nested within districts—relationships between variables can complicate things. The authors offer a simple strategy: when modeling complex interactions, include the basic components (e.g. region, age, or gender) directly in the core of the approximation. This makes the calculations manageable without sacrificing reliability.

M. Goplerud, O. Papaspiliopoulos, G. Zanella, "Partially factorized variational inference for high-dimensional mixed models", Biometrika, Volume 112, Issue 2, 2025, DOI https://doi.org/10.1093/biomet/asae067

OMIROS PAPASPILIOPOULOS

Bocconi University

Department of Decision Sciences

Full Professor

GIACOMO ZANELLA

Bocconi University

Department of Decision Sciences

Associate Professor

Can We Trust Complex Statistical Models?

The pitfall of cutting corners

A smarter approximation

Tested in the real world

More data, better results

Practical advice

OMIROS PAPASPILIOPOULOS

GIACOMO ZANELLA

Research

When Your Career Is Decided by Adjectives

Globalizing Firms in an Anti-Global Age

The Clauses That Cripple Work: Non-Competition Agreements Are Holding Italy Back

The Two Opposing Strategies for Decision-Making in Uncertain Conditions

Leonardo Borlini Joins the Editorial Board of the German Law Journal

Insider Trading Regulations? Just a Legal Placebo

A New Statistical Model to Infer Complex Structures in Multilayer Networks

The New Department Heads

Weather that Moves Markets

Fear of Losing Votes Pushes Democracy to Losing Itself