Maria Dimakopoulou

Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix

Vickie Zhang, Michael Zhao, Anh Le, Maria Dimakopoulou, Nathan Kallus (CODE 2023)

Surrogate index approaches have recently become a popular method of estimating longer-term impact from shorter-term outcomes. In this paper, we leverage 1098 test arms from 200 A/B tests at Netflix to empirically investigate to what degree would decisions made using a surrogate index utilizing 14 days of data would align with those made using direct measurement of day 63 treatment effects. Focusing specifically on linear “auto-surrogate” models that utilize the shorter-term observations of the long-term outcome of interest, we find that the statistical inferences that we would draw from using the surrogate index are ~95% consistent with those from directly measuring the long-term treatment effect. Moreover, when we restrict ourselves to the set of tests that would be “launched” (i.e. positive and statistically significant) based on the 63-day directly measured treatment effects, we find that relying instead on the surrogate index achieves 79% and 65% recall..
[Paper]

Calibrated Recommendations as a Minimum-Cost Flow Problem

Himan Abdollahpouri, Zahra Nazari, Alex Gain, Clay Gibson, Maria Dimakopoulou, Jesse Anderton, Benjamin Carterette, Mounia Lalmas, Tony Jebara (WSDM 2023)

Calibration in recommender systems has recently gained significant attention. In the recommended list of items, calibration ensures that the various (past) areas of interest of a user are reflected with their corresponding proportions. For instance, if a user has watched, say, 80 romance movies and 20 action movies, then it is reason-able to expect the recommended list of movies to be comprised of about 80% romance and 20% action movies as well. Calibration is particularly important given that optimizing towards accuracy often leads to the user’s minority interests being dominated by their main interests, or by a few overall popular items, in the recommendations they receive. In this paper, we propose a novel approach based on the minimum-cost flow problem for generating calibrated recommendations. In a series of experiments using two publicly available datasets, we demonstrate the superior performance of our proposed approach compared to the state-of-the-art in generating relevant and calibrated recommendation lists.
[Paper]

Society of Agents: Regrets Bounds of Concurrent Reinforcement Learning

Yan Chen, Perry Dong, Qinxun Bai, Maria Dimakopoulou, Wei Xu, Zhengyuan Zhou (NeurIPS 2022)

We consider the concurrent reinforcement learning problem where N agents simultaneously learn to make decisions in the same environment by sharing experience with each other. Existing works in this emerging area have empirically demonstrated that Thompson sampling (TS) based algorithms provide a particularly attractive alternative for inducing cooperation, because each agent can independently sample a belief environment (and compute a corresponding optimal policy) from the joint posterior computed by aggregating all agents’ data , which induces diversity in exploration among agents while benefiting shared experience from all agents. However, theoretical guarantees in this area remain under-explored; in particular, no regret bound is known on TS based concurrent RL algorithms. In this paper, we fill in this gap by considering two settings; the finite-horizon episodic RL setting and the infinite-horizon RL problem. For both settings, we establish a per-agent regret bound that decreases at an optimal rate of Θ(1/√Ν), which manifests the power of cooperation in concurrent RL.
[Paper]

Online Multi-Armed Bandits with Adaptive Inference

Maria Dimakopoulou, Zhimei Ren, Zhengyuan Zhou (NeurIPS 2021)

During online decision making in Multi-Armed Bandits, one needs to conduct inference on the true mean reward of each arm based on data collected so far at each step. However, since the arms are adaptively selected–thereby yielding non-iid data–conducting inference accurately is not straightforward. Our thesis in this paper is that more sophisticated inference schemes that take into account the adaptive nature of the sequentially collected data can unlock further performance gains, even though both UCB and TS type algorithms are optimal in the worst case. In particular, we propose a variant of TS-style algorithms–which we call doubly adaptive TS–that leverages recent advances in causal inference and adaptively reweights the terms of a doubly robust estimator on the true mean reward of each arm. Through 20 synthetic domain experiments and a semi-synthetic experiment based on data from an A/B test of a web service, we demonstrate that using an adaptive inferential scheme (while still retaining the exploration efficacy of TS) provides clear benefits in online decision making. In addition, we also provide a finite-time regret bound of doubly adaptive TS that matches (up to log factors) those of UCB and TS algorithms, thereby establishing that its improved practical benefits do not come at the expense of worst-case suboptimality.
[Paper]

Post-Contextual-Bandit Inference

Aurélien Bibaut, Maria Dimakopoulou, Nathan Kallus, Antoine Chambaz, Mark van der Laan (NeurIPS 2021)

The adaptive nature of the data collected by contextual bandit algorithms, however, makes inference difficult: standard estimators are no longer asymptotically normally distributed and classic confidence intervals fail to provide correct coverage. While this has been addressed in non-contextual settings by using stabilized estimators, the contextual setting poses unique challenges that we tackle for the first time in this paper. We propose the Contextual Adaptive Doubly Robust (CADR) estimator, the first estimator for policy value that is asymptotically normal under contextual adaptive data collection. The main technical challenge in constructing CADR is designing adaptive and consistent conditional standard deviation estimators for stabilization. Extensive numerical experiments using 57 OpenML datasets demonstrate that confidence intervals based on CADR uniquely provide correct coverage.
[Paper]

Risk Minimization from Adaptively Collected Data

Aurélien Bibaut, Nathan Kallus, Maria Dimakopoulou, Antoine Chambaz, Mark van der Laan (NeurIPS 2021)

Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that leverages the importance sampling structure to obtain rates with the right dependence on the exploration rate. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory.
[Paper]

Sequential Causal Inference in a Single World of Connected Units

Aurélien Bibaut, Maya Petersen, Nikos Vlassis, Maria Dimakopoulou, Mark van der Laan (arXiv preprint)

We consider adaptive designs for a trial involving multiple individuals, where the variables of one individual depend on its past and on the past of other individuals. Our goal is to learn a mean outcome, averaged across the individuals, that we would observe if we carried out a given sequence of counterfactual interventions for several time steps. We show how to identify a statistical parameter that equals this mean counterfactual outcome, and how to perform inference for this parameter, while adaptively learning an oracle design defined as a parameter of the true data generating distribution.
[Paper]

Doubly Robust Off-Policy Evaluation with Shrinkage

Yi Su, Maria Dimakopoulou,
Akshay Krishnamurthy, Miroslav Dudik (ICML 2020)

We design a new family of estimators for off-policy evaluation in contextual bandits. Our estimators are based on the asymptotically optimal approach of doubly robust estimation, but they shrink importance weights to obtain a better bias-variance tradeoff in finite samples. Our approach adapts importance weights to the quality of a reward predictor, interpolating between doubly robust estimation and direct modeling. When the reward predictor is poor, we recover previously studied weight clipping, but when the reward predictor is good, we obtain a new form of shrinkage. Extensive experiments on bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.
[Paper]

ADMM SLIM: Sparse Recommendations for Many Users

Harald Steck, Maria Dimakopoulou,
Nickolai Riabov, Tony Jebara (WSDM 2020)

The Sparse Linear Method (SLIM) is a well-established approach for top-N recommendations. This article proposes several improvements to SLIM that are enabled by the Alternating Directions Method of Multipliers (ADMM). We evaluate our approach against the original SLIM and other state-of-the-art approaches on three well-known data-sets. We find that not only our approach reduces training time considerably but also achieves an up to 25% improvement in recommendation accuracy due to better optimization. We compare the approaches in experiments that simulate scenarios of cold-starting and large catalog sizes compared to a small user base.
[Paper]

Marginal Posterior Sampling for Slate Bandits

Maria Dimakopoulou, Nikos Vlassis, Tony Jebara (IJCAI 2019)

We introduce a new Thompson sampling-based algorithm, called marginal posterior sampling, for online slate bandits, that is characterized by three key ideas. First, it postulates that the slate-level reward is a monotone function of the marginal unobserved rewards of the actions in the slate's slots, which it does not attempt to estimate. Second, it maintains posterior distributions for the marginal reward of each slot's actions rather than a slate-level reward posterior. Third, it optimizes at the slot-level rather than the slate-level, which makes it computationally efficient. Simulation results show substantial advantages of marginal posterior sampling over state-of-the-art alternatives.
[Paper]

Balanced Linear Contextual Bandits

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, Guido Imbens (AAAI 2019; oral)

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models. We develop algorithms for contextual bandits that integrate balancing methods from causal inference in their estimation to make it less prone to problems of bias. We prove that our algorithms match the state of the art regret bound guarantees and demonstrate their strong practical advantage on a large number of supervised learning datasets and on synthetic examples that simulate model misspecification and covariate shift.
[Paper] [Poster]

On the Design of Estimators for Bandit Off-Policy Evaluation

Nikos Vlassis, Aurelien Bibaut,
Maria Dimakopoulou, Tony Jebara (ICML 2019)

Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. We describe a framework for designing estimators for bandit off-policy evaluation. Given a base estimator and a parametrized class of control variates, we seek a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we discuss approaches for optimizing this function. We present our main results in the context of multi-armed bandits, and we decribe a contextual bandits estimator that is shown to perform well in multi-class cost-sensitive classification datasets.
[Paper]

Scalable Coordinated Exploration in Concurrent Reinforcement Learning

Maria Dimakopoulou, Ian Osband, Benjamin Van Roy (NeurIPS 2018)

We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling (Dimakopoulou and Van Roy, 2018) and randomized value function learning (Osband et al., 2016). We demonstrate that, for simple tabular contexts, the approach is competitive with previously proposed tabular model learning methods (Dimakopoulou and Van Roy, 2018). With a higher-dimensional problem and a neural network value function representation, the approach learns quickly with far fewer agents than alternative exploration schemes.
[Paper] [Poster][Demo]

Coordinated Exploration in Concurrent Reinforcement Learning

Maria Dimakopoulou, Benjamin Van Roy (ICML 2018; long talk)

We consider a team of reinforcement learning agents that concurrently learn to operate in a common environment, while sharing data in real-time. We identify three properties that are essential to efficient coordinated exploration: real-time adaptivity to shared observations, commitment to carry through with action sequences that reveal new information, and diversity across learning opportunities pursued by different agents. We demonstrate that optimism-based approaches fall short with respect to diversity, while naive extensions of Thompson sampling lack commitment. We propose seed sampling that offers a general approach to designing effective coordination algorithms for concurrent reinforcement learning and has substantial advantages over alternative exploration schemes.
[Paper] [Demo] [ICML 2018 Slides] [ICML 2018 Video]

Estimation Considerations in Contextual Bandits

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, Guido Imbens

We study a new consideration for the exploration vs. exploitation framework which is that the way exploration is conducted in the present may affect the bias and variance in the potential outcome model estimation in subsequent stages of learning. We show that contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We propose new contextual bandit designs, combining parametric and nonparametric statistical estimation methods with causal inference methods in order to reduce the estimation bias and provide empirical evidence that guides the choice among the alternatives in different scenarios.
[Paper]

Market-based dynamic service mode switching in wireless networks

Maria Dimakopoulou, Nicholas Bambos, Martin Valdez-Vivas, John Apostolopoulos (PIMRC 2017)

We consider a virtualized wireless networking architecture, where infrastructure access points of different carriers form a marketplace of resources and bid service deals to a mobile device. At each point in time the mobile evaluates the available service deals and dynamically decides which one to accept and use in the next transmission interval. Its objective is to minimize the long term cumulative service cost and latency cost to transmit packets in its buffer. We develop a model of this architecture, which allows for the formulation and computation of the optimal control for the mobile to accept an offered deal amongst many and switch into the corresponding service mode. The performance of the optimal and low-complexity heuristic controls is probed via simulation.
[Paper]

Reliable and Efficient Performance Monitoring in Linux

Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, Nicholas Bambos (Supercomputing 2016)

We address a published eratum in the Performance Monitoring Unit (PMU) of Intel Sandy Bridge, Ivy Bridge and Haswell processors with hyper-threading enabled which causes cross hyper-thread hardware counter corruption and may produce unreliable results. We propose a cache-coherence style protocol, which we implement in the Linux kernel to address the issue by introducing cross hyper-thread dynamic event scheduling. Additionally, we improve event scheduling efficiency by introducing a bipartite graph matching algorithm which optimally schedules events onto hardware counters consistently. The improvements have been contributed to the upstream Linux kernel v4.1.
[Paper]

Uber | Stanford