Modeling human decision-making in multi-armed bandits

P. Reverdy, V. Srivastava, and N.E. Leonard
Proc. of the Multidisciplinary Conference on Reinforcement Learning and Decision Making, Princeton, NJ, 2013

(pdf)
We study the exploration-exploitation trade-off in human decision-making in the context of multi-armed bandit problems. We consider a Bayesian multi-armed bandit problem with Gaussian rewards and develop an efficient algorithm that captures the empirically observed trends in human-decision making. In particular, the proposed algorithm captures the following features observed in human decision-making: (i) increased exploration with increasing time horizon of the decision task, (ii) ambiguity bonus, and (iii) inherent decision-noise. We characterize the efficiency of the algorithm in terms of the regret associated with the decision process. For the no decision-noise case, we demonstrate that as the model parameters encoding the prior knowledge of the human are varied, the performance may change from efficient (logarithmic regret) to the worst case (linear regret).