About Me

I am leading the Home Personalization Engineering organization at Spotify, a 60-people organization that is responsible for generating, ranking and distributing personalized content recommendations across music, podcasts and audiobooks on the Spotify’s homepage of 640+ million listeners in a way that maximizes listeners’ joy while supporting Spotify’s strategic needs. The 10 teams of Home Personalization Engineering innovate in a range of ML disciplines – contextual bandits, reinforcement learning, transformer-based architectures, multi-objective optimization & causal inference – as well as ML infrastructure, data engineering and backend systems at scale. In addition, I am serving as the Strategy Lead of the Company Bet focused on content discovery on Spotify, that spans 120+ FTEs across engineering, product, data science and design across several organizations at Spotify.

Before joining Spotify, I was at Netflix, first driving causal recommendation initiatives and subsequently building and leading the Adaptive Experimentation team, which was a cross-functional team of researchers, engineers and data scientists focused on delivering new experimentation capabilities at Netflix (adaptive bandit-based online tests instead of A/B tests; valid inference from adaptive online tests; ML-based short-term proxy metrics of long-term online metrics etc.). Prior to Netflix, I did my PhD on Reinforcement Learning and Causal Inference at Stanford, where I was advised by Benjamin Van Roy and Susan Athey. Before my PhD at Stanford, I worked at Google Research on the design and deployment of large-scale optimization algorithms for Google Technical Infrastructure and Google Ad Exchange.

I enjoy tennis, swimming, traveling around the world and exploring impressionist & surrealist art.

Research

Calibrated Recommendations as a Minimum-Cost Flow Problem

Himan Abdollahpouri, Zahra Nazari, Alex Gain, Clay Gibson, Maria Dimakopoulou, Jesse Anderton, Benjamin Carterette, Mounia Lalmas, Tony Jebara (WSDM 2023)

Calibration in recommender systems has recently gained significant attention. In the recommended list of items, calibration ensures that the various (past) areas of interest of a user are reflected with their corresponding proportions. For instance, if a user has watched, say, 80 romance movies and 20 action movies, then it is reason-able to expect the recommended list of movies to be comprised of about 80% romance and 20% action movies as well. Calibration is particularly important given that optimizing towards accuracy often leads to the user’s minority interests being dominated by their main interests, or by a few overall popular items, in the recommendations they receive. In this paper, we propose a novel approach based on the minimum-cost flow problem for generating calibrated recommendations. In a series of experiments using two publicly available datasets, we demonstrate the superior performance of our proposed approach compared to the state-of-the-art in generating relevant and calibrated recommendation lists.
[Paper]

Society of Agents: Regrets Bounds of Concurrent Reinforcement Learning

Yan Chen, Perry Dong, Qinxun Bai, Maria Dimakopoulou, Wei Xu, Zhengyuan Zhou (NeurIPS 2022)

We consider the concurrent reinforcement learning problem where N agents simultaneously learn to make decisions in the same environment by sharing experience with each other. Existing works in this emerging area have empirically demonstrated that Thompson sampling (TS) based algorithms provide a particularly attractive alternative for inducing cooperation, because each agent can independently sample a belief environment (and compute a corresponding optimal policy) from the joint posterior computed by aggregating all agents’ data , which induces diversity in exploration among agents while benefiting shared experience from all agents. However, theoretical guarantees in this area remain under-explored; in particular, no regret bound is known on TS based concurrent RL algorithms. In this paper, we fill in this gap by considering two settings; the finite-horizon episodic RL setting and the infinite-horizon RL problem. For both settings, we establish a per-agent regret bound that decreases at an optimal rate of Θ(1/√Ν), which manifests the power of cooperation in concurrent RL.
[Paper]

Online Multi-Armed Bandits with Adaptive Inference

Maria Dimakopoulou, Zhimei Ren, Zhengyuan Zhou (NeurIPS 2021)

During online decision making in Multi-Armed Bandits, one needs to conduct inference on the true mean reward of each arm based on data collected so far at each step. However, since the arms are adaptively selected–thereby yielding non-iid data–conducting inference accurately is not straightforward. Our thesis in this paper is that more sophisticated inference schemes that take into account the adaptive nature of the sequentially collected data can unlock further performance gains, even though both UCB and TS type algorithms are optimal in the worst case. In particular, we propose a variant of TS-style algorithms–which we call doubly adaptive TS–that leverages recent advances in causal inference and adaptively reweights the terms of a doubly robust estimator on the true mean reward of each arm. Through 20 synthetic domain experiments and a semi-synthetic experiment based on data from an A/B test of a web service, we demonstrate that using an adaptive inferential scheme (while still retaining the exploration efficacy of TS) provides clear benefits in online decision making. In addition, we also provide a finite-time regret bound of doubly adaptive TS that matches (up to log factors) those of UCB and TS algorithms, thereby establishing that its improved practical benefits do not come at the expense of worst-case suboptimality.
[Paper]

Post-Contextual-Bandit Inference

Aurélien Bibaut, Maria Dimakopoulou, Nathan Kallus, Antoine Chambaz, Mark van der Laan (NeurIPS 2021)

Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking because they can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, nonetheless, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies. The adaptive nature of the data collected by contextual bandit algorithms, however, makes this difficult: standard estimators are no longer asymptotically normally distributed and classic confidence intervals fail to provide correct coverage. While this has been addressed in non-contextual settings by using stabilized estimators, the contextual setting poses unique challenges that we tackle for the first time in this paper. We propose the Contextual Adaptive Doubly Robust (CADR) estimator, the first estimator for policy value that is asymptotically normal under contextual adaptive data collection. The main technical challenge in constructing CADR is designing adaptive and consistent conditional standard deviation estimators for stabilization. Extensive numerical experiments using 57 OpenML datasets demonstrate that confidence intervals based on CADR uniquely provide correct coverage.
[Paper]

Risk Minimization from Adaptively Collected Data

Aurélien Bibaut, Nathan Kallus, Maria Dimakopoulou, Antoine Chambaz, Mark van der Laan (NeurIPS 2021)

Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that leverages the importance sampling structure to obtain rates with the right dependence on the exploration rate. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory.
[Paper]

Sequential Causal Inference in a Single World of Connected Units

Aurélien Bibaut, Maya Petersen, Nikos Vlassis, Maria Dimakopoulou, Mark van der Laan (arXiv preprint)

We consider adaptive designs for a trial involving multiple individuals that we follow along their trajectories over time. We allow for the variables of one individual to depend on its past and on the past of other individuals. Our goal is to learn a mean outcome, averaged across the individuals, that we would observe, if we started from some given initial state, and we carried out a given sequence of counterfactual interventions for several time steps. We show how to identify a statistical parameter that equals this mean counterfactual outcome, and how to perform inference for this parameter, while adaptively learning an oracle design defined as a parameter of the true data generating distribution. Oracle designs of interest include the design that maximizes the efficiency for a statistical parameter of interest, or designs that mix the optimal treatment rule with a certain exploration distribution. We also show how to design adaptive stopping rules for sequential hypothesis testing.
[Paper]

Doubly Robust Off-Policy Evaluation with Shrinkage

Yi Su, Maria Dimakopoulou,
Akshay Krishnamurthy, Miroslav Dudik (ICML 2020)

We design a new family of estimators for off-policy evaluation in contextual bandits. Our estimators are based on the asymptotically optimal approach of doubly robust estimation, but they shrink importance weights to obtain a better bias-variance tradeoff in finite samples. Our approach adapts importance weights to the quality of a reward predictor, interpolating between doubly robust estimation and direct modeling. When the reward predictor is poor, we recover previously studied weight clipping, but when the reward predictor is good, we obtain a new form of shrinkage. Extensive experiments on bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.
[Paper]

ADMM SLIM: Sparse Recommendations for Many Users

Harald Steck, Maria Dimakopoulou,
Nickolai Riabov, Tony Jebara (WSDM 2020)

The Sparse Linear Method (SLIM) is a well-established approach for top-N recommendations. This article proposes several improvements to SLIM that are enabled by the Alternating Directions Method of Multipliers (ADMM). We evaluate our approach against the original SLIM and other state-of-the-art approaches on three well-known data-sets. In our experiments, we find that not only our approach reduces training time considerably but also achieves an up to 25% improvement in recommendation accuracy due to better optimization. Finally, we compare the approaches in experiments that simulate scenarios of cold-starting and large catalog sizes compared to relatively small user base, which often occur in practice.
[Paper]

Marginal Posterior Sampling for Slate Bandits

Maria Dimakopoulou, Nikos Vlassis, Tony Jebara (IJCAI 2019)

We introduce a new Thompson sampling-based algorithm, called marginal posterior sampling, for online slate bandits, that is characterized by three key ideas. First, it postulates that the slate-level reward is a monotone function of the marginal unobserved rewards of the actions in the slate's slots, which it does not attempt to estimate. Second, it maintains posterior distributions for the marginal reward of each slot's actions rather than a slate-level reward posterior. Third, it optimizes at the slot-level rather than the slate-level, which makes it computationally efficient. Simulation results show substantial advantages of marginal posterior sampling over state-of-the-art alternatives.
[Paper]

Balanced Linear Contextual Bandits

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, Guido Imbens (AAAI 2019; oral)

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of bias. We prove that our algorithms match the state of the art regret bound guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data.
[Paper] [Poster]

On the Design of Estimators for Bandit Off-Policy Evaluation

Nikos Vlassis, Aurelien Bibaut,
Maria Dimakopoulou, Tony Jebara (ICML 2019)

Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. We describe a framework for designing estimators for bandit off-policy evaluation. Given a base estimator and a parametrized class of control variates, we seek a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we discuss approaches for optimizing this function. We present our main results in the context of multi-armed bandits, and we decribe a contextual bandits estimator that is shown to perform well in multi-class cost-sensitive classification datasets.
[Paper]

Scalable Coordinated Exploration in Concurrent Reinforcement Learning

Maria Dimakopoulou, Ian Osband, Benjamin Van Roy (NeurIPS 2018)

We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling (Dimakopoulou and Van Roy, 2018) and randomized value function learning (Osband et al., 2016). We demonstrate that, for simple tabular contexts, the approach is competitive with previously proposed tabular model learning methods (Dimakopoulou and Van Roy, 2018). With a higher-dimensional problem and a neural network value function representation, the approach learns quickly with far fewer agents than alternative exploration schemes.
[Paper] [Poster][Demo]

Coordinated Exploration in Concurrent Reinforcement Learning

Maria Dimakopoulou, Benjamin Van Roy (ICML 2018; long talk)

We consider a team of reinforcement learning agents that concurrently learn to operate in a common environment, while sharing data in real-time. We identify three properties that are essential to efficient coordinated exploration: real-time adaptivity to shared observations, commitment to carry through with action sequences that reveal new information, and diversity across learning opportunities pursued by different agents. We demonstrate that optimism-based approaches fall short with respect to diversity, while naive extensions of Thompson sampling lack commitment. We propose seed sampling that offers a general approach to designing effective coordination algorithms for concurrent reinforcement learning and has substantial advantages over alternative exploration schemes.
[Paper] [Demo] [ICML 2018 Slides] [ICML 2018 Video]

Estimation Considerations in Contextual Bandits

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, Guido Imbens

We study a new consideration for the exploration vs. exploitation framework which is that the way exploration is conducted in the present may affect the bias and variance in the potential outcome model estimation in subsequent stages of learning. We show that contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We propose new contextual bandit designs, combining parametric and nonparametric statistical estimation methods with causal inference methods in order to reduce the estimation bias and provide empirical evidence that guides the choice among the alternatives in different scenarios.
[Paper]

Market-based dynamic service mode switching in wireless networks

Maria Dimakopoulou, Nicholas Bambos, Martin Valdez-Vivas, John Apostolopoulos (PIMRC 2017)

We consider a virtualized wireless networking architecture, where infrastructure access points of different carriers form a marketplace of resources and bid service deals to a mobile device. At each point in time the mobile evaluates the available service deals and dynamically decides which one to accept and use in the next transmission interval. Its objective is to minimize the long term cumulative service cost and latency cost to transmit packets in its buffer. We develop a model of this architecture, which allows for the formulation and computation of the optimal control for the mobile to accept an offered deal amongst many and switch into the corresponding service mode. The performance of the optimal and low-complexity heuristic controls is probed via simulation.
[Paper]

Reliable and Efficient Performance Monitoring in Linux

Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, Nicholas Bambos (Supercomputing 2016)

We address a published eratum in the Performance Monitoring Unit (PMU) of Intel Sandy Bridge, Ivy Bridge and Haswell processors with hyper-threading enabled which causes cross hyper-thread hardware counter corruption and may produce unreliable results. We propose a cache-coherence style protocol, which we implement in the Linux kernel to address the issue by introducing cross hyper-thread dynamic event scheduling. Additionally, we improve event scheduling efficiency by introducing a bipartite graph matching algorithm which optimally schedules events onto hardware counters consistently. The improvements have been contributed to the upstream Linux kernel v4.1.
[Paper]

Invited Talks & Workshops

(Dec 7th 2023) Invited speaker at the Stanford/UCSF Bay Area Tech Economics Seminar.

(Sep 22nd 2022) Co-organizer of RecSys 2022 REVEAL workshop on Reinforcement Learning for Recommender Systems.

(Mar 17th 2022) Invited speaker at the Women in Data Science (WiDS) conference giving a talk on "Causal Adaptive Learning".

(Dec 14th 2021) Co-organizer of the NeurIPS 2021 Causal Sequential Decisions workshop currently accepting submissions.

(Oct 25th 2021) Invited speaker at Hamsa Bastani's class at the Wharton School, University of Pennsylvania.

(Jul 23rd 2021) Invited speaker and panelist at the ICML 2021 RL4RealLife workshop giving a talk on "Estimation and Exploration Considerations for Practical RL".

(Jun 11th 2021) Invited speaker at the Personalization, Ranker and Search (PRS) annual summit organized by Netflix giving a joint talk with Chris Alvino and Devise Parekh on "Messaging at Netflix: Causal Modeling in Practice".

(Dec 6th 2020) Invited speaker at the NeurIPS 2020 Expo giving a talk on "Slate Bandit Online Learning & Off-Policy Evaluation".

(Sep 26th 2020) Co-organizer of the RecSys 2020 "Bandit and Reinforcement Learning from User Interactions" (REVEAL) workshop.

(July 13th 2020) Invited speaker at the ICML 2020 WiML workshop giving a talk on "Slate Bandit Online Learning & Off-Policy Evaluation".

(Oct 20th 2019) Invited speaker at INFORMS 2019 "Bandits and Reinforcement Learning" session giving a talk on "Coordinated Exploration in Concurrent Reinforcement Learning".

(Sep 20th 2019) Co-organizer of the RecSys 2019 "Bandit and Reinforcement Learning from User Interactions" (REVEAL) workshop and invited speaker giving a talk on "Marginal Posterior Sampling for Slate Bandits".

(Sep 16th 2019) Invited speaker at DeepMind London giving a talk on "Coordinated Exploration in Concurrent Reinforcement Learning".

Get in touch