Some recent worthwhile papers:
Discrete Sequential Prediction of Continuous Actions for Deep RL (
https://arxiv.org/abs/1705.05035) – modern competitive approach to discretization of continuous action and state spaces which is capable of finding global optimum of value functions (unlike DDPG / qNAF).
Count-Based Exploration in Feature Space for Reinforcement Learning (
https://arxiv.org/abs/1706.08090) – new optimistic exploration algorithm grounded on frequency of deep state features observations and incorporating generalisation about uncertainty.
Noisy Networks for Exploration (
https://arxiv.org/abs/1706.10295) – exploration guided by learnable (with SGD) noise added to NN parameters. Results seems remarkable.
Teacher-Student Curriculum Learning (
https://arxiv.org/abs/1707.00183) – smart and adaptive curriculum aimed at fastest learning and consisting of two separate networks.
Hindsight Experience Replay (
https://arxiv.org/abs/1707.01495) – implicit curriculum which helps to deal with very sparse rewards.
Observational Learning by Reinforcement Learning (
https://arxiv.org/abs/1706.06617) – agents are taught to learn from observation of other agents.
Programmable Agents (
https://arxiv.org/abs/1706.06383) – agents are given a program expressed in formal language, learn mapping of this language terms to perceptions finally becoming able to generalize to unseen terms and unseen circumstances.
Uncertainty Decomposition in Bayesian Neural Networks with Latent Variables (
https://arxiv.org/abs/1706.08495) – elaborates on two fundamental uncertainty components (epistemic and aleatoric); proposes novel risk sensitive objective for safe reinforcement learning.
Gated-Attention Architectures for Task-Oriented Language Grounding (
https://arxiv.org/abs/1706.07230) – end-to-end trainable neural architecture for reinforcement learning using natural language instructions as an input.
Constrained policy optimization (
https://arxiv.org/abs/1705.10528,
http://bair.berkeley.edu/blog/2017/07/06/cpo/) – one of the very first pioneering works devoted to safe actions generated from policy being optimized.
Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics (
https://arxiv.org/abs/1706.04317) – efficient training and remarkable policy transfer between tasks achieved by dealing with causality in RL.
End-to-End Learning of Semantic Grasping (
https://arxiv.org/abs/1707.01932) – very first end2end algorithm dealing with grasping objects by robohand guided by user specified class of the required object.
Emergence of locomotion behaviours in rich environments (
https://arxiv.org/abs/1707.02286,
https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/) – rich environment helps to promote the learning of complex behavior; in particular, novel scalable variant of policy gradient allows agents to learn very complex behaviours guided by the simple reward (the distance passed).
Learning human behaviours from motion capture by adversarial imitation (
https://arxiv.org/abs/1707.02201,
https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/) – adversarial idea for learning of humanlike movement patterns from limited demonstrations consisting only of partially observed state features, without access to actions, even when the demonstrations come from a body with different and unknown physical parameter
Robust imitation of diverse behaviours (
https://deepmind.com/documents/95/diverse_arxiv.pdf,
https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/) – proposed model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings allowing to do imitation learning less sensitive to discrepancies between train and test data, and avoiding mode collapse (compared with GAN).