Despite the success of deep RL in recent years, its performance is still highly dependent on proper hyperparameter tuning and careful implementation choices. To address these issues and make RL work reliably out-of-the-box, we are
studying mechanisms to improve the stability (Voelcker et al., 2025), mitigate value overestimation (Hussing et al., 2024), and enable the replicability (Eaton et al., 2023) of RL algorithms. Among other contributions, this line of work developed REPPO, a tuning-free replacements for PPO (Voelcker et al., 2025). This work is the PhD thesis of Marcel Hussing.
Building deep reinforcement learning (RL) agents that find a good policy with
few samples has proven notoriously challenging. To achieve sample efficiency,
recent work has explored updating neural networks with large numbers of gradient steps
for every new sample. While such high update-to-data (UTD) ratios have
shown strong empirical performance, they also introduce instability to the training
process. Previous approaches need to rely on periodic neural network parameter
resets to address this instability, but restarting the training process is infeasible
in many real-world applications and requires tuning the resetting interval. In this
paper, we focus on one of the core difficulties of stable training with limited samples:
the inability of learned value functions to generalize to unobserved on-policy
actions. We mitigate this issue directly by augmenting the off-policy RL training
process with a small amount of data generated from a learned world model. Our
method, Model-Augmented Data for Temporal Difference learning (MAD-TD)
uses small amounts of generated data to stabilize high UTD training and achieve
competitive performance on the most challenging tasks in the DeepMind control suite.
Our experiments further highlight the importance of employing a good
model to generate data, MAD-TD’s ability to combat value overestimation, and
its practical stability gains for continued learning.
@inproceedings{Voelcker2025MADTD,author={Voelcker, Claas A and Hussing, Marcel and Eaton, Eric and Farahmand, Amir-massoud and Gilitschenski, Igor},year={2025},title={MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL [Spotlight]},booktitle={International Conference on Learning Representations (ICLR)},}
We show that deep reinforcement learning algorithms can retain their ability to
learn without resetting network parameters in settings where the number of gradient
updates greatly exceeds the number of environment samples by combatting value
function divergence. Under large update-to-data ratios, a recent study by Nikishin
et al. (2022) suggested the emergence of a primacy bias, in which agents overfit
early interactions and downplay later experience, impairing their ability to learn.
In this work, we investigate the phenomena leading to the primacy bias. We inspect
the early stages of training that were conjectured to cause the failure to learn and
find that one fundamental challenge is a long-standing acquaintance: value function
divergence. Overinflated Q-values are found not only on out-of-distribution but also
in-distribution data and can be linked to overestimation on unseen action prediction
propelled by optimizer momentum. We employ a simple unit-ball normalization
that enables learning under large update ratios, show its efficacy on the widely
used dm_control suite, and obtain strong performance on the challenging dog tasks,
competitive with model-based approaches. Our results question, in parts, the prior
explanation for sub-optimal learning due to overfitting early data.
@inproceedings{Hussing2024DissectingDeepRL,author={Hussing, Marcel and Voelcker, Claas A and Gilitschenski, Igor and Farahmand, Amir-massoud and Eaton, Eric},year={2024},title={Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and Divergence},booktitle={1st Reinforcement Learning Conference (RLC)},}
The replicability crisis in the social, behavioral, and data sciences has led to the
formulation of algorithm frameworks for replicability — i.e., a requirement that
an algorithm produce identical outputs (with high probability) when run on two
different samples from the same underlying distribution. While still in its infancy,
provably replicable algorithms have been developed for many fundamental tasks
in machine learning and statistics, including statistical query learning, the heavy
hitters problem, and distribution testing. In this work we initiate the study of
replicable reinforcement learning, providing a provably replicable algorithm for
parallel value iteration, and a provably replicable version of R-max in the episodic
setting. These are the first formal replicability results for control problems, which
present different challenges for replication than batch learning settings.
@inproceedings{Eaton2023ReplicableRL,author={Eaton, Eric and Hussing, Marcel and Kearns, Michael and Sorrell, Jessica},year={2023},title={Replicable Reinforcement Learning},booktitle={Neural Information Processing Systems},}