Research

Thank you for your interest in finding out more about my research!



Figure to the right: List of 20 most common keywords from my first 10 papers

My research develops methodologies in trustworthy machine learning (ML), predictive and prescriptive analytics, aiming to facilitate the full pipeline of data-driven decision-making, primarily within healthcare and sustainability. In such high-stakes applications, the components of trustworthy analytics—interpretability and explainability, stability and robustness, fairness, and privacy—are often not just desirable but essential for the successful deployment of predictive and prescriptive models.
For example:

In each of these cases, traditional performance goals remain critical—healthcare models must still achieve high predictive accuracy, renewable energy investments must be economically viable, and vaccine distribution must optimize lives saved. Often, the solutions must also balance multiple dimensions of trustworthy analytics, such as combining the aforementioned requirements with interpretability. This gives rise to fundamental trade-offs.

I focus on both developing models that incorporate trustworthiness considerations a priori and on understanding, evaluating, and navigating the trade-offs among the components of trustworthy analytics a posteriori. This is challenging due to the inherent complexities of modern data science problems, which involve variability in data distribution, large volumes of data and decisions, and the high velocity of dynamically arriving data—collectively known as “V-complexity.” To address these challenges, I leverage the rich modeling power of optimization paradigms such as mixed-integer (MIO), robust (RO), and multi-objective (MOO) optimization, and I design tailored algorithms that capture problem-specific structures like temporal or spatial variability and sparsity.
Returning to the examples from above:

In what follows, I lay out my research program, focusing on each component of trustworthy ML separately. While each aspect of my research tends to emphasize a specific dimension of trustworthy ML—such as stability in slowly varying regression under sparsity or fairness in vaccine distribution—the reality is that these dimensions often coexist and interact in ways that are not necessarily competitive. For example, the stability achieved in slowly varying models also enhances interpretability, as the smooth variations in model coefficients across temporal or spatial structures make the models easier to understand and trust. To fully leverage these synergies, my ongoing work studies these dimensions in conjunction.

Stability



Find out more...

Stability is important for both ensuring consistent performance and enhancing interpretability: significant changes in a model’s structure or the resulting analytical insights can lead to skepticism and hesitation for adoption (Babic et al., 2021). My research has focused on structural stability—requiring that the model's structure remains stable in the face of "small changes" in the data—which is a stronger condition than the more commonly used notion of semantic stability, which requires stable predictions (Dwyer and Holte, 2007). I have developed methodologies that promote stability across various dimensions, including time, space, and retraining iterations. For example, the important features for predicting a building's energy consumption may change depending on the hour of the day but are likely to do so in a "smooth" way over time; the same holds true for air quality prediction models over adjacent spatial areas.

The framework of slowly varying regression under sparsity (Bertsimas et al., 2024) enables the rigorous estimation of sparse linear regression models where the underlying regression coefficients vary slowly and sparsely under some temporal, spatial, or any graph-based structure. Our formulation minimizes the total prediction error across all time periods or spatial areas, and includes both standard regularization considerations (for robustness and sparsity purposes) and new ones to prevent sharp variations in coefficients. A key theoretical contribution is an exact binary convex reformulation of the problem using a new equality on Moore-Penrose inverses that convexifies the non-convex objective function while coinciding with the original objective on all feasible binary points; this approach also extends to a more general class of sparse quadratic problems. From an algorithmic standpoint, we developed a cutting plane-type algorithm that efficiently solves the reformulated convex problem to optimality, a fast heuristic alternative that produces high-quality solutions, and an effective hyperparameter tuning strategy with provable recovery guarantees; we make our Julia implementation available open-source. Our computational studies demonstrate that this approach outperforms existing methods across multiple metrics in numerous applications, including energy consumption prediction, air quality monitoring, and preventive maintenance (Bertsimas et al., 2020).

Assessing stability in decision trees presents new challenges compared to regression models. In Bertsimas and Digalakis Jr. (2023), we introduce a novel distance metric for decision trees and use it to determine a tree's level of stability. Building on this metric, we developed a methodology to train stable decision trees and identify trade-offs between stability, predictive power, and interpretability. We released our decision tree distance and stability calculator as the open-source Python package "dt-distance." Beyond regression and decision trees, in Bersimas et al. (2024), we propose a holistic framework for stabilizing general ML models during retraining. We developed an MIO algorithm that tackles the retraining problem across different data batch updates in a global fashion. By incorporating custom-defined distance metrics directly into the optimization process, our method ensures consistent analytical insights, which are essential for model interpretability, ease of implementation, and fostering trust with users. We also provide theoretical guarantees on the Pareto optimality and generalization of the solutions obtained by our methodology. In both cases, through real-world production case studies in healthcare, we find that a minimal and controllable impact on predictive power can yield significant improvements in model stability.

Robustness



Find out more...

Robustness is another critical component of trustworthy analytics, especially when it comes to dealing with "unknown unknowns," such as climate change. In Bertsimas et al. (2024), we develop a multi-period robust optimization (RO) framework to assist OCP, a major phosphate mining and fertilizer company that accounts for 5.6% of Morocco's GDP, in their goal of becoming carbon neutral by 2040. Our approach guides both strategic decisions on the placement of solar panels and batteries (worth $2 billion) and the corresponding hourly operational decisions over a 20-year horizon. We reduce their reliance on non-green energy by up to 95% while generating a net present value of $500 million. To handle the problem's large dimensionality, we introduce a novel machine learning-based scenario reduction technique, allowing planning based on typical days rather than the entire horizon. To guard against uncertainties in solar generation forecasts and climate change, we are the first to combine RO with distributionally robust optimization. Our framework is now in use by OCP's senior management. Currently, we study the integration of wind power and the broader interplay between renewable energy sources for industrial decarbonization beyond OCP.

Mixed-integer optimization (MIO) can naturally represent interpretable machine learning models, such as sparse regression and decision trees, and often improves their predictive power and proves their optimality in fitting the training data. Over the past decade, MIO-based methods for interpretable supervised ML, e.g., sparse regression, have scaled from being intractable beyond a few hundreds of features to now being able to handle hundreds of thousands. Nevertheless, applications in, e.g., genomics and medical imaging commonly involve ultra-high-dimensional datasets with millions of features. The backbone method (Bertsimas et al., 2022) is a general framework to scale interpretable ML techniques to ultra-high-dimensional datasets. The proposed framework operates in two phases: first, we extract a "backbone set" of potentially relevant features by solving a number of specially chosen, tractable subproblems; then, we use traditional techniques to solve a reduced problem to optimality or near-optimality, considering only the backbone features. For sparse linear regression, our theoretical analysis shows that, under certain conditions and with high probability, the backbone set consists only of the truly relevant features. Our method solves, in minutes, sparse regression problems with tens of millions of features and decision tree problems with hundreds of thousands of features, and competes with state-of-the-art methods without sacrificing the interpretability of the learned model. In follow-up work (Digalakis et al., 2023), we release backbone-learn, an open-source Python package and framework for scaling general MIO problems with indicator variables to high-dimensional problems; in the future, I plan to turn this into a hub for MIO-based interpretable ML.

In streaming settings, massive amounts of data arrive dynamically with high velocity and must be processed in real-time, e.g., finding trending topics by counting queries to a search engine. Since the stream of queries (and even its frequency distribution) is too large to store, state-of-the-art techniques typically rely on random hashing (hence sacrificing both optimality and interpretability) to combine frequencies for different queries and so approximate the full distribution. In Bertsimas et al. (2021), we propose an MIO- and ML-based approach to hashing in this setting, learning patterns in the initial part of the stream that help avoid costly collisions between frequencies. From a technical standpoint, we develop exact (MIO- and dynamic programming-based) and heuristic (block coordinate descent) algorithms to learn an optimal (or near-optimal) hashing scheme from queries that appeared in the observed stream prefix, and use ML to hash unseen queries. From a conceptual standpoint, we show how MIO and ML can be of value in streaming settings, where their use has been scarce. Our method outperforms existing approaches by one to two orders of magnitude in approximation error per query on real-world search query data.

Fairness



Find out more...

As vaccines became available during the later stages of the COVID-19 pandemic, ensuring equitable access to vaccination emerged as a critical challenge. The Federal Emergency Management Agency reached out to our group to develop a strategy for the fair distribution of vaccines across the U.S. To address this, we created an optimization model that minimizes future pandemic deaths while also prioritizing fairness in the allocation of limited vaccine supplies (Bertsimas et al., 2022). The model integrates an epidemiological model and relies on a scalable coordinate descent algorithm, allowing us to handle the problem’s large volume—initially involving millions of decision variables. The proposed approach increases the effectiveness of the vaccination campaign by an estimated 20%, saving an extra 4,000 lives in the U.S. over a three-month period, while achieving critical fairness objectives (by reducing the death toll in several states without hurting others) and being robust to uncertainties and forecast errors. Our conclusions influenced how FEMA allocated its vaccines, e.g., increasing its focus on Texas and Florida.

Privacy



Find out more...

In response to the need for rigorous privacy guarantees, I have focused on developing differentially private algorithms, ensuring that any analysis conducted on sensitive datasets does not leak information about the individuals whose data are contained therein. In Digalakis et al. (2021), we address the problem of estimating the density of a stream of users, which expresses the fraction of all users that actually appear in the stream while maintaining one of the strongest privacy standards, user-level pan-privacy. This approach protects user privacy even against adversaries who might occasionally observe the internal state of the algorithm. By optimally utilizing the allocated "privacy budget," we demonstrate superior performance, both theoretically and experimentally, compared to conventional sampling methods. In the early stages of the COVID-19 pandemic, our research group at MIT worked to develop both epidemiological and risk assessment models for the disease (Bertsimas et al., 2021). My contributions initially focused on the curation of a database of clinical studies for use in training our ML models. One particularly interesting strand of research, given the sensitivity of patient-level data, focused on an optimization-based approach for synthetic data generation that matched the aggregate statistics reported in published studies. I wish to pursue such techniques further in the future: such synthetic data release approaches can, under conditions, be shown to preserve privacy.