Residual Overfit Method of Exploration
Abstract
Exploration is a crucial aspect of bandit and reinforcement learning algorithms. The uncertainty quantification necessary for exploration often comes from either closedform expressions based on simple models or resampling and posterior approximations that are computationally intensive. We propose instead an approximate exploration methodology based on fitting only two point estimates, one tuned and one overfit. The approach, which we term the residual overfit method of exploration (Rome), drives exploration towards actions where the overfit model exhibits the most overfitting compared to the tuned model. The intuition is that overfitting occurs the most at actions and contexts with insufficient data to form accurate predictions of the reward. We justify this intuition formally from both a frequentist and a Bayesian information theoretic perspective. The result is a method that generalizes to a wide variety of models and avoids the computational overhead of resampling or posterior approximations. We compare Rome against a set of established contextual bandit methods on three datasets and find it to be one of the best performing.
1 Introduction
The use of machine learning in interactive environments such as recommender systems [14, 26, 18] and display ads [15, 6, 11] motivates the study of how to balance taking high value actions (exploitation) with gathering diverse data to learn better models (exploration). The framework of contextual multiarmed bandits, and its extension to reinforcement learning in dynamic environments, provides guidance for addressing this important task [25]. Generally, efficient algorithms tackle the tradeoff by encouraging actions with high model uncertainty, either by adding an explicit bonus for uncertainty as in upper confidence bound (UCB) algorithms [13] or by sampling from the posterior distribution over the parameters to promote uncertain actions as in Thompson sampling [6]. In either case, some quantification of uncertainty is needed.
There are several challenges to uncertainty quantification for both UCB and Thompsonsampling algorithms. With the exception of simple models like linear and/or conjugate priors, either the sampling or posterior distributions are not analytically known, and instead approximation methods are needed, such as bootstrapping, Markov chain Monte Carlo (MCMC), or variational inference (VI) [2]. Both bootstrapping and MCMC are computationally intensive and the latter requires diagnostics to assess convergence. VI is scalable but tends to require a specialized algorithm for each class of model to be effective and has the property of underestimating posterior variance due to its objective being an expectation with respect to the approximating distribution [17].
Against this background, our motivation is to develop an effective methodology for exploration that is scalable and adaptable to a wide range of models. Crucially, we seek a method of uncertainty quantification that applies to complex predictive models that may be biased. Bias is introduced because the best estimators in terms of meansquared error use tools to prevent overfitting, such as L1/L2 regularization, bagging [3], dropout [1], early stopping [20, 4], and the like. While improving prediction, these make uncertainty quantification more difficult as the uncertainty consists of more than just variance. We show how fitting one additional model without these tools, i.e., formulating an overfit model of the data, and combining its predictions with those of the tuned estimator enables approximate uncertainty quantification. We formalize this as the residual overfit and we use it to drive exploration in what we call the residual overfit method of exploration (Rome).
From a frequentist perspective, the residual overfit provides an upper approximation of pointwise uncertainty. From a Bayesian information theoretical perspective, the residual overfit provides an upper approximation of the information gain from exploring at any one new point. These two perspectives suggest it is a possible approximation for driving exploration. To explore this in practice, we consider a bandit experimental setup and compare both UCB and Thompson sampling algorithms based on Rome to benchmark methods that either use resampling to tackle complex models or use exact uncertainty quantification for simple models. Across our experiments, we find Rome performs competitively, often getting the best performance, despite its simplicity and tractability. Together, our results suggest Rome is a good option for driving exploration in practical settings with complex predictive models.
2 The Residual Overfit
Definitions
Given a design , we consider data consisting of noisy observations of a function , , where has mean zero and variance and is independent of for . Let represent the observed data. Considering the design fixed, the randomness of reduces to the randomness of . The design may also be random and all the conclusions would still follow since they hold for any one design. Let and be estimators for the unknown function based on the training data . Function is generally understood to be trained for estimation, i.e., for the lowest mean squared error (). Function is trained for unbiasedness, i.e., it satisfies the constraint for any , where expectations are taken with respect to the training data, that is, with respect to the randomness of . We define the residual overfit at as,
(1) 
When and are independent, the expected squared residual overfit at is equal to the mean squared error of plus the variance of ,
(2) 
where . Recall and are taken with respect to the training data. Note is a fixed input and is not random.
We have
(3)  
(4)  
(5) 
where the last equality is by the unbiasedness of and the independence of and .
The estimators and can be made independent by training on two disjoint random splits of the data.
Why not fit a model to the prediction error of instead?
An appealing alternative in studying the error of may be to directly fit a predictive model of the squared error of . This, however, involves both the estimation error and the noise. Namely, fix and consider ; then predicting the error of at would estimate . Taking expectations over the data as well yields . Thus, at best, a model for the prediction error of would involve the irreducible variance , which does not vanish. Therefore, this may not well reflect the estimation error of . This is known as the white noise problem in reinforcement learning [23]. In contrast, the expected squared residual overfit only involves estimation errors. In addition, if the data is heteroskedastic, then will vary across even under no model uncertainty and will incorrectly bias exploration.
See Figure 1 for a visual comparison of the two approaches.
3 Residual Overfit Method of Exploration (ROME)
The properties of the squared residual overfit in Eq. 2 are suggestive of the variance term in standard approaches for explorationexploitation. Consider the discrete action set with context comprising continuous and/or discrete features. Before each interaction with the environment, the bandit observes a context and must score each action . The action with the maximum score is taken greedily. The residual overfit may be applied to this setting using the following scores,
(6)  
(7) 
for some exploration hyperparameter . Exploration is guided towards actions where either the reward or error of is high, or the variance of is high. This approach is called the residual overfit method of exploration (Rome). Note that Rome uses a single sample (i.e. a dataset) Monte Carlo estimate of the expectation presented in Eq. 2. If pure exploration is required, the value outside of the residual overfit terms in Eq. 6 and Eq. 7 may be replaced with 0.
3.1 Exponential Family Moment Matching
We consider how to apply the method to exponential families. Many distributions of interest belong to the exponential family, e.g., univariate or multivariate Gaussian, Bernoulli, multinomial, Poisson [27]. In deep neural networks and decision trees, the last layer typically includes a member of the exponential family that induces a loss with respect to the observations and latent representation.
Exponential family distributions over random variable take the form,
(8) 
with natural parameter and sufficient statistics .
Distributions in the exponential family have the property that the distribution that minimizes the KL divergence from the target distribution with a given mean and variance is obtained by matching moments. Here, we use as the mean and variance.
3.1.1 Binary Outcomes
Use distribution to model the probability of binary outcomes. If has been trained to minimize Bernoulli (logistic) loss then its output is a suitable candidate as the mean of the Beta distribution, given its conjugate relationship with the Bernoulli (i.e. updating a Beta prior with Bernoulli evidence with Bayes’ rule yields another Beta distribution). We can calculate the Beta pseudocounts by matching the mean and variance,
(9)  
(10) 
If we interpret the output of as the Bernoulli probability of binary outcome with prior, then the posterior parameters are equivalent to,
(11) 
4 Bayesian Information Theoretical Perspective
The analysis so far has required the errors of and to be independent, necessitating random data splits. It is not ideal to split the data because it will lead to worse estimates, especially with small data sizes. Is it possible to analyze Rome when and are not trained on different splits of the training data? To address this question and provide a broader perspective on the residual overfit, we now consider it in the setting of Bayesian information theory.
There are various information theoretic criteria for choosing the next action with Bayesian inference. A prominent class of methods maximizes the information gain for the parameters from unknown response given dataset ,
(12) 
where the subscript is used to indicate an arbitrary fixed query point. Eq. 12 is equivalent to maximizing the decrease in posterior entropy after the new observation [17]. Due to symmetry of information gain, Eq. 12 can be expressed in terms of entropy of the predicted target, avoiding unnecessary posterior updates [16, 10],
(13) 
The model appears twice in Eq. 13: in the first term with parameters marginalized out and in the second term with posterior averaged entropy. Due to this, the approach is unreliable when the predictive uncertainty of is underestimated, since it determines both the entropy and conditional entropy. In fact, in the limiting case of a point estimate for the posterior, such as the maximum a posteriori (MAP) estimate, the information gain is zero across all .
Our diagnosis of the problem of approximating information gain is that it is misleading to use the predictive entropy of under the typical parameter in the first term of Eq. 13 (e.g. using variational inference or MAP). The peril is that the uncertainty of some actions is underestimated and leads to their elimination in an interactive setting, a type II error in identifying actions requiring exploration. In contrast, overestimating uncertainty (type I error) leads to selfcorrection over time as more samples are gathered. To reflect this explorative asymmetry, we consider the lowest upper bound of the predictive entropy induced by an approximating distribution . To formalize this notion, for any , add a nonnegative slack (recall, any KL divergence is nonnegative) to the information gain,
(14) 
then look for that minimizes under the empirical average of . This is equivalent to minimizing
(15) 
Eq. 15 may be estimated as follows. The approximation for the distribution of the outer expectation must be close to the true posterior . Hence, it is amenable to established methods for approximate Bayesian inference such as MCMC, variational inference, or MAP. As discussed before, the empirical approximation for the RHS of the KL divergence in Eq. 15 minimizes
For a univariate Gaussian distribution on , Eq. 15 reduces to a closedform expression,
(16) 
where is the mean prediction of a MAP inferred model, is the mean prediction of the model, and is the irreducible variance. This recovers Rome in the deterministic exploration setting when .
Eq. 15 extends to other observation likelihoods, e.g., the Bernoulli observation likelihood yields,
(17) 
where and are the MAP and model probabilities of success, respectively. Poisson observation likelihood results in,
(18) 
where is the mean predicted rate of the MAP model and is the mean predicted rate of the model.
4.1 Practicality of ROME
When deploying a model in practice, it is standard procedure to perform a hyperparameter search to find the model architecture and training settings that perform best on heldout validation data. In this way, an overfit model, which we call , is a byproduct in the search for the best . There are two main benefits to this observation. First, in this setting, Rome avoids additional training time. Second, there is low marginal engineering cost for deploying , since it was at one point a candidate for , so likely has much in common with such as features, targets, training algorithm, and deployment pipeline.
What procedure should be used to select out of a candidate set of models? While it is likely advantageous to combine more than one overfit model, perhaps by cycling through them by iteration to encourage diversity, we have focused here on the case where there is a single for generality and ease of exposition. Theory suggests that should be selected to give the lowest variance unbiased estimate of the reward.
5 Related Work
There are various scalable algorithms that combine uncertainty quantification and model expressibility. Stochastic gradient Langevin dynamics (SGLD) [28] adds noise to stochastic gradients in order to sample from a posterior distribution under appropriate conditions for the optimization surface and step size. SGLD applies to gradientbased models and, in a bandit setting, requires taking multiple gradient steps at prediction time to avoid correlated samples. Variational dropout [12, 8] adapts the method of dropout regularization to perform variational approximation. It applies to deep neural networks and, since it is based on VI, underestimates posterior variance [17]. Bootstrapped Thompson sampling [19] treats a set of models trained on bootstrap resamples of a dataset as samples of the parameters over which Thompson sampling may be applied. It can be used with the widest range of models but requires either resampling then training a model on each step or training multiple models from resamples in batch mode. If the rewards are sparse then a large number of resamples are required in batch mode.
In practice, methods that perform shallow exploration are popular due to the ease with which they equip existing tuned models with exploration. Epsilongreedy, Boltzmann exploration [5], and lastlayer variance [24] admit highly expressive models but ignore the uncertainty of most or all of the parameters [21]. This inflexibility may be compensated in some cases by an accurate tuning and decay of the exploration rate or temperature.
6 Empirical Evaluation
In the empirical evaluation we compare Rome against several benchmarks and find that it performs competitively against both shallow and deep exploration methods.
Methods
The following methods are compared in the empirical evaluation,

RomeTS: Rome with Thompson sampling. Sample score from with pseudocounts given by Eq. 10.

RomeUCB: Rome with upper confidence bound. Upper confidence bound score of with pseudocounts given by Eq. 10.

LinUCB: contextual bandits with linear payoffs using the upper confidence bound [7].

Epsilon greedy: pick the action with the highest predicted reward with probability and a uniform random action with probability on each step.

BootstrapTS: bootstrap the replay buffer and train models on each of the replications of the data [19]. Thompson sampling is implemented by sampling one model uniformly each step and using its predicted rewards greedily to pick the action.

Uniform random: pick the action uniform randomly on each step. Equivalent to epsilon greedy with .
In the experiments, BootstrapTS uses replications, making training 20 times as computationally intensive as epsilon greedy and 10 times as computationally intensive as Rome. For the UCB methods, the weighting for the upper bound is set to and in epsilon greedy.
With the exception of epsilon greedy and uniform random, all methods are controlled with the same implementation settings using,

a random forrest reward classifier model with the default settings from the scikitlearn package^{1}^{1}1version 0.21.1 https://scikitlearn.org/0.21/ which uses an ensemble of 10 decision trees.

A constant explore rate of 0.01 to mimic a small number of organic observations arriving outside of the bandit channel [22].

The model is retrained every 100 iterations. In realworld settings it is usually infeasible to retrain after every interaction, necessitating batched interactions.
Method 





LinUCB  
Epsilon Greedy  
BootstrapTS  
RomeUCB  
RomeTS  
Uniform Random 
Datasets
Evaluating exploreexploit performance requires setting up an interactive environment to assess the impact of acquired data on subsequent performance. We consider three classification datasets, and in each case, partial feedback is simulated by the environment providing a reward of 1 if the action corresponding true class is chosen and 0 otherwise. The instances are held in an arbitrary random order across 10 repetitions of the experiments. Actions are performed uniformly at random until every action has been observed at least once. The random seed for the methods and initial exploration varies across repetitions. The following datasets are studied,

Covertype: data comprises 7 classes of forest cover type predicted from 54 attributes over 581,012 instances.^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/covertype

Bach Chorales: Bach chorale harmony dataset with 17 features to predict 65 classes of harmonic structure in 5,665 examples.^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/Bach+Choral+Harmony

MovieLensdepleting: a matrix of 100,000 interactions between 610 users and 7,200 items [9]. To replicate the cold start task recommender systems face when introducing new items, we split the items randomly into two equalsized groups: existing items and cold start items. The bandit has access to all the historical user and existing item interactions but the interactions between users and cold start items receives only partial feedback. In each step of the experiment, the bandit chooses which cold start item to recommend to a user based on the observed interaction and context history. The bandit makes 10 passes through the dataset. To replicate the depleting effect of consuming items, the same cold start item and user may give a reward of 1 no more than once, and subsequently 0.
Results
In Table 1, we find that Rome performs well across the datasets. BootstrapTS performs best for small action spaces where the rewards are denser. As the number of actions grows, it becomes harder for a small number of positive examples to appear in a significant number of bootstrap samples. Across datasets, the Thompson sampling approaches (RomeTS and BootstrapTS) outperformed the UCB methods. This is likely due to the benefit of stochasticity in the batch action setting in addition to the strong empirical performance for Thompson sampling observed more generally [6].
Figure 2 shows the cumulative reward curves of the methods as a function of the number of interactions in the Bach Chorales dataset. Early on, LinUCB explores more than the other modelbased approaches, and as a result, BootstrapTS achieves the highest cumulative reward at the end. The most challenging dataset was MovieLensdepleting, due to both the large action space (3,600 actions) and depleting rewards. For this dataset, Figure 3 illustrates how only LinUCB, RomeTS, and RomeUCB were able to continue discovering high value actions after 10 passes through the dataset.^{4}^{4}4This holds for Uniform trivially since it depletes the popular items much slower than the other methods.
7 Conclusions
In this paper we developed theoretical and empirical justifications for the merits of combining a tuned and overfit model for exploration. The residual overfit method of exploration (Rome) approximately identifies actions and contexts with the highest parameter variance. The method can be applied to exploreexploit settings by adding the best regularized estimate of the reward. We provided a frequentist interpretation and a Bayesian information theoretic interpretation that shows that the residual overfit approximates an upper bound on the information gain of the parameters. Experiments comparing Rome with widely used alternatives shows that it performs well at balancing exploration and exploitation.
We thank Nikos Vlassis, Ehsan Saberian, Dawen Liang, Pannaga Shivaswamy, Maria Dimakopoulou, Yves Raimond, Darío GarcíaGarcía, and Justin Basilico for their insightful feedback.
References
 [1] (2013) Understanding dropout. Advances in neural information processing systems 26, pp. 2814–2822. Cited by: §1.
 [2] (2006) Pattern recognition and machine learning. Springer. Cited by: §1.
 [3] (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §1.
 [4] (2001) Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Advances in neural information processing systems, pp. 402–408. Cited by: §1.
 [5] (2017) Boltzmann exploration done right. In Advances in neural information processing systems, pp. 6287–6296. Cited by: §5.
 [6] (2011) An empirical evaluation of Thompson sampling. Advances in neural information processing systems 24, pp. 2249–2257. Cited by: §1, §6.
 [7] (2011) Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: 3rd item.
 [8] (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In International conference on machine learning, pp. 1050–1059. Cited by: §5.
 [9] (2015) The Movielens datasets: history and context. Acm transactions on interactive intelligent systems (TIIS) 5 (4), pp. 1–19. Cited by: 3rd item.
 [10] (2011) Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: §4.
 [11] (2019) Learning from bandit feedback: an overview of the stateoftheart. arXiv preprint arXiv:1909.08471. Cited by: §1.
 [12] (2015) Variational dropout and the local reparameterization trick. Advances in neural information processing systems 28, pp. 2575–2583. Cited by: §5.
 [13] (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
 [14] (2010) A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World Wide Web, pp. 661–670. Cited by: §1.
 [15] (2010) Exploitation and exploration in a performance based contextual advertising system. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 27–36. Cited by: §1.
 [16] (1956) On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pp. 986–1005. Cited by: §4.
 [17] (1992) Informationbased objective functions for active data selection. Neural computation 4 (4), pp. 590–604. Cited by: §1, §4, §5.
 [18] (2018) Explore, exploit, and explain: personalizing explainable recommendations with bandits. In Proceedings of the 12th ACM conference on recommender systems, pp. 31–39. Cited by: §1.
 [19] (2016) Deep exploration via bootstrapped DQN. Advances in neural information processing systems 29, pp. 4026–4034. Cited by: §5, 5th item.
 [20] (1998) Early stoppingbut when?. In Neural networks: tricks of the trade, pp. 55–69. Cited by: §1.
 [21] (2018) Deep Bayesian bandits showdown: an empirical comparison of Bayesian deep networks for Thompson sampling. arXiv preprint arXiv:1802.09127. Cited by: §5.
 [22] (2020) BLOB: a probabilistic model for recommendation that combines organic and bandit signals. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data dining, pp. 783–793. Cited by: 2nd item.
 [23] (2010) Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development 2 (3), pp. 230–247. Cited by: §2.
 [24] (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §5.
 [25] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
 [26] (2014) Exploreexploit in topn recommender systems via Gaussian processes. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 225–232. Cited by: §1.
 [27] (2008) Graphical models, exponential families, and variational inference. Now Publishers Inc. Cited by: §3.1.
 [28] (2011) Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning, pp. 681–688. Cited by: §5.