Research

For a list of publications and preprints, please visit my Google Scholar page.

Slowly Varying Machine Learning with Applications to Sustainability and Healthcare

We consider data science problems where the underlying machine learning model varies slowly and sparsely according to some underlying graph-based temporal or spatial structure. Using mixed-integer optimization techniques, we develop interpretable machine learning models that capture the aforementioned structure, with an emphasis on real-world applications in sustainability and healthcare.

Slowly Varying Regression under Sparsity

with Dimitris Bertsimas, Michael Li, Omar Skali-Lami
Major Revision in Operations Research. Preprint. Code available at https://github.com/vvdigalakis/SSVRegression.git.

We introduce the framework of slowly varying regression under sparsity, which allows sparse regression models to vary slowly and sparsely. We formulate the problem of parameter estimation as a mixed integer optimization problem and demonstrate that it can be reformulated exactly as a binary convex optimization problem through a novel exact relaxation. The relaxation utilizes a new equality on Moore-Penrose inverses that convexifies the non-convex objective function while coinciding with the original objective on all feasible binary points. This allows us to solve the problem significantly more efficiently and to provable optimality using a cutting plane-type algorithm. We develop a highly optimized implementation of such an algorithm, which substantially improves upon the asymptotic computational complexity of a straightforward implementation. We further develop a fast heuristic method that is guaranteed to produce a feasible solution and, as we empirically illustrate, generates high-quality warm-start solutions for the binary optimization problem. We show, on both synthetic and real-world datasets, that the resulting algorithm outperforms competing formulations in comparable times across a variety of metrics including estimation accuracy, predictive power, and computational time, and is highly scalable, enabling us to train models with 10,000s of parameters. We make our implementation available open-source at https://github.com/vvdigalakis/SSVRegression.git.

Improving Stability in Decision Tree Models

with Dimitris Bertsimas
Under review. Preprint.

Owing to their inherently interpretable structure, decision trees are commonly used in applications where interpretability is essential. Recent work has focused on improving various aspects of decision trees, including their predictive power and robustness; however, their instability, albeit well-documented, has been addressed to a lesser extent. In this paper, we take a step towards the stabilization of decision tree models through the lens of real-world health care applications due to the relevance of stability and interpretability in this space. We introduce a new distance metric for decision trees and use it to determine a tree's level of stability. We propose a novel methodology to train stable decision trees and investigate the existence of trade-offs that are inherent to decision tree models — including between stability, predictive power, and interpretability. We demonstrate the value of the proposed methodology through an extensive quantitative and qualitative analysis of six case studies from real-world health care applications, and we show that, on average, with a small 4.6% decrease in predictive power, we gain a significant 38% improvement in the model's stability.

Slowly Varying Machine Learning

with Dimitris Bertsimas, Yu Ma,  Phevos Paschalidis
In preparation.

We extend the slowly varying framework beyond sparse regression.

Optimization and Machine Learning for Sustainable Operations

In collaboration with industry partners, we develop robust optimization and machine learning methodologies to promote sustainability in industrial settings. For example, we develop an optimization framework to facilitate industrial decarbonization via investment in renewable energy; and we use machine learning and optimization to improve the efficiency of maintenance operations.

Decarbonizing OCP: Towards Industrial Decarbonization via Robust Solar Capacity Expansion

with Dimitris Bertsimas, Ryan Cory-Wright
Minor Revision in  M&SOM. Preprint.

2023 M&SOM Practice-Based Research Competition - Finalist.

2023 MIT ORC  Best Student Paper Competition - Honorable Mention.

2022 MIMO Symposium - Cognex Prize, Most Impactful Project Award.

We present our collaboration with the OCP Group, one of the world's largest producers of phosphate and phosphate-based products, in support of a green initiative designed to reduce OCP's carbon emissions significantly. We study the problem of decarbonizing OCP's electricity supply by installing a mixture of solar panels and batteries to minimize its time-discounted investment cost plus the cost of satisfying its remaining demand via the Moroccan national grid. OCP is currently designing its renewable investment strategy, using insights gleaned from our optimization model, and has pledged to invest 130 billion MAD (12.3 billion USD) in a green initiative by 2027, a subset of which involves decarbonization.


We immunize our model against deviations between forecast and realized solar generation output via a combination of robust and distributionally robust optimization. To account for variability in daily solar generation, we propose a data-driven robust optimization approach that prevents excessive conservatism by averaging across uncertainty sets. To protect against variability in seasonal weather patterns induced by climate change, we invoke distributionally robust optimization techniques. Under a ten billion MAD (approx. one billion USD) investment by OCP, the proposed methodology reduces the carbon emissions which arise from OCP's energy needs by more than 70% while generating a net present value (NPV) of five billion MAD over a twenty-year planning horizon. Moreover, a twenty billion MAD investment induces a 90% reduction in carbon emissions and generates an NPV of around two billion MAD.


To fulfill the Paris climate agreement, rapidly decarbonizing the global economy in a financially sustainable fashion is imperative. Accordingly, this work develops a robust optimization methodology that enables OCP to decarbonize at a profit by purchasing solar panels and batteries. Moreover, the methodology could be applied to decarbonize other industrial consumers. Indeed, our approach suggests that decarbonization's profitability depends on solar capacity factors, energy prices, and borrowing costs.

Machine Learning for Preventive Maintenance at OCP

with Dimitris Bertsimas, Omar Skali-Lami
In preparation.

We propose a machine learning-based framework to improve the efficiency of maintenance operations, from description to prediction to intervention.

Optimization and Machine Learning for Improved Healthcare Operations

We develop analytics and optimization methodologies to improve healthcare operations focusing on the COVID-19 pandemic. For example, in collaboration with FEMA, we address the questions of where to locate COVID-19 mass vaccination sites and how to distribute vaccines in the US using a data-driven approach.


Where to locate COVID-19 mass vaccination facilities?

with Dimitris Bertsimas, Alexandre Jacquillat, Michael Li, Alessandro Previero
Published in Naval Research Logistics, 2021. Paper

We address the questions of where to locate COVID-19 mass vaccination sites and how to distribute vaccines in the US using a data-driven approach.

The outbreak of COVID-19 led to a record-breaking race to develop a vaccine. However, the limited vaccine capacity creates another massive challenge: how to distribute vaccines to mitigate the near-end impact of the pandemic? In the United States in particular, the new Biden administration is launching mass vaccination sites across the country, raising the obvious question of where to locate these clinics to maximize the effectiveness of the vaccination campaign. This paper tackles this question with a novel data-driven approach to optimize COVID-19 vaccine distribution. We first augment a state-of-the-art epidemiological model, called DELPHI, to capture the effects of vaccinations and the variability in mortality rates across age groups. We then integrate this predictive model into a prescriptive model to optimize the location of vaccination sites and subsequent vaccine allocation. The model is formulated as a bilinear, nonconvex optimization model. To solve it, we propose a coordinate descent algorithm that iterates between optimizing vaccine distribution and simulating the dynamics of the pandemic. As compared to benchmarks based on demographic and epidemiological information, the proposed optimization approach increases the effectiveness of the vaccination campaign by an estimated 20%, saving an extra 4000 extra lives in the United States over a 3-month period. The proposed solution achieves critical fairness objectives—by reducing the death toll of the pandemic in several states without hurting others—and is highly robust to uncertainties and forecast errors—by achieving similar benefits under a vast range of perturbations.

From predictions to prescriptions: A data-driven response to COVID-19

with Dimitris Bertsimas et al.
Published in Healthcare Management Science, 2021. Paper

First place, 2020 INFORMS Health Applications Society Pierskalla Paper Award.

We design analytics-based tools to combat the COVID-19 pandemic.

The COVID-19 pandemic has created unprecedented challenges worldwide. Strained healthcare providers make difficult decisions on patient triage, treatment and care management on a daily basis. Policy makers have imposed social distancing measures to slow the disease, at a steep economic price. We design analytical tools to support these decisions and combat the pandemic. Specifically, we propose a comprehensive data-driven approach to understand the clinical characteristics of COVID-19, predict its mortality, forecast its evolution, and ultimately alleviate its impact. By leveraging cohort-level clinical data, patient-level hospital data, and census-level epidemiological data, we develop an integrated four-step approach, combining descriptive, predictive and prescriptive analytics. First, we aggregate hundreds of clinical studies into the most comprehensive database on COVID-19 to paint a new macroscopic picture of the disease. Second, we build personalized calculators to predict the risk of infection and mortality as a function of demographics, symptoms, comorbidities, and lab values. Third, we develop a novel epidemiological model to project the pandemic’s spread and inform social distancing policies. Fourth, we propose an optimization model to re-allocate ventilators and alleviate shortages. Our results have been used at the clinical level by several hospitals to triage patients, guide care management, plan ICU capacity, and re-distribute ventilators. At the policy level, they are currently supporting safe back-to-work policies at a major institution and vaccine trial location planning at Janssen Pharmaceuticals, and have been integrated into the US Center for Disease Control’s pandemic forecast.

Optimization and Machine Learning for Massive-Scale Analytics

In this thread of work, we develop analytics, optimization (esp. mixed-integer), and machine learning methodologies for massive-scale problems, i.e., data science problems with characteristics such as large volume (of data or decisions or both) and high velocity (i.e., large amounts of data arriving dynamically and needing to be processed in real-time). 

The Backbone Method for Ultra-High Dimensional Sparse Machine Learning

with Dimitris Bertsimas
Published in Machine Learning, 2022. Paper

We present the backbone method, a general framework that enables sparse and interpretable supervised machine learning methods to scale to ultra-high dimensional problems, i.e., problems with a massive number of features. We solve sparse regression problems with 10^7 features in minutes and 10^8 features in hours, as well as decision tree problems with 10^5 features in minutes.The proposed method operates in two phases: we first determine the backbone set, consisting of potentially relevant features, by solving a number of tractable subproblems; then, we solve a reduced problem, considering only the backbone features. For the sparse regression problem, our theoretical analysis shows that, under certain assumptions and with high probability, the backbone set consists of the truly relevant features. Numerical experiments on both synthetic and real-world datasets demonstrate that our method outperforms or competes with state-of-the-art methods in ultra-high dimensional problems, and competes with optimal solutions in problems where exact methods scale, both in terms of recovering the truly relevant features and in its out-of-sample predictive performance.

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

with Dimitris Bertsimas
Published in IEEE Transactions on Knowledge and Data Engineering, 2021. Paper

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.

Improved Pan-Private Stream Density Estimation

with Minos Garofalakis, George Karystinos
2019. Preprint

In this work, we develop new differentially private algorithms to analyze streaming data. Specifically, we consider the problem of estimating the density of a stream of users (or, more generally, elements), which expresses the fraction of all users that actually appear in the stream. We focus on one of the strongest privacy guarantees for the streaming model, namely user-level pan-privacy, which ensures that the privacy of any user is protected, even against an adversary that observes, on rare occasions, the internal state of the algorithm. Our proposed algorithms employ optimally all the allocated privacy budget, are specially tailored for the streaming model, and, hence, outperform both theoretically and experimentally the conventional sampling-based approach.


A preliminary version of the work presented in this paper was developed during my Diploma Thesis:

Data Analytics with Differential Privacy

Advisors: Minos Garofalakis, George Karystinos

Differential privacy is the state-of-the-art definition for privacy, that “addresses the paradox of learning nothing about an individual while learning useful information about a population”. In other words, differential privacy guarantees that any analysis performed on a sensitive dataset leaks no information about the individuals whose data are contained therein. In this thesis, we develop differentially private algorithms to analyze distributed and streaming data. In Chapter 3, we develop and analyze algorithms to learn bayesian networks in a distributed and differentially private fashion. In Chapter 4, we develop an improved pan-private stream density estimator.



2018. Thesis.