Reinforcement Learning and Decision Making ConferenceThe fourth iteration of this conference was held in Montreal July 7-10, 2019.
- Conference Website: http://rldm.org/
- Conference Brochure: http://otto.lab.mcgill.ca/temp/RLDM2019ProgramBrochure.pdf
- A PDF of all abstracts accepted: http://rldm.org/papers/abstracts.pdf
Table of ContentsNeuro Into TutorialAssociative TasksAssociative learning Two forms of learning (conditional reinforcement):Neutral AssociationsQuestionsDopamine Prediction ErrorDynamic Decisions in HumansIrrationality : how should we define it?Naturalistic Decision MakingDynamic Decision Making (DDM)Consequential choice ProblemsChoice from SamplingControlPost office / Water Flow MicroworldQuestions Arising from their WorkInstance Based LearningACT-R Description-Experience GapDescription:Experience:IBLTCounterfactual RLCounterfactualsBatch Policy OptimizationPolicy EvaluationInteresting Idea Related to Policy OptimizationA Big IdeaMoving the GoalpostDirect Batch Policy SearchFinding the Best Policy in a classDistributional Reinforcement LearningHistoryDistributional TD LearningValidation with Simple Experimental Tasks with AnimalsCount-based Exploration with Successor RepresentationSuccessor RepresentationCounting the number of visits to a state along a trajectoryUpdating Existing AlgorithmsFunction ApproximationResultDirections in Distributional LearningThe Big idea Why does Distributional RL Work?The Virtuous Circle of RLFurther Reading Substance Use DisorderHyperbolic DiscountingHeart Health with RLHeart steps appQuestions:Termination CriticTemporal AbstractionOptionsOption Transition ModelIntrinsic Motivation (a.k.a Curiosity)The Learning Progress HypothesisIAC AlgorithmAnatomy of a Social InteractionTrust GamePrediction Error in the BrainWhat makes someone trustworthy?Moral Strategy ModelTheory of MindPrediction GameGuilt Aversion as a Useful Component of ValuesLearning to Learn to CommunicateArbitration between imitation and emulation during human observational learningComputational ModelsImplicationsCan a Game Require Theory of Mind?Working MemorySimple ExperimentResultsAdding a Seperate Working Memory ProcessBut...Working Memory Interferes with RLSo What's Happening?Why is this Important?Reinforcement Learning for RobotsLearning Reusable Models from Self-SupervisionHow We Know When Not to ThinkWhat does "Possible" mean?Play : Interplay of Goals and Subgoals in Mental DevelopmentThe Reward Hypothesis is great but...Where are we?
Melissa Sharpe , UCLA
These are well established theories on associations by tasks to brain. She uses it to test computational questions about behaviour.
The main principle of association between actions and rewards based on Pavlovian conditioning.
Example: Rats sound tone associated with food
- if they associate soured with food turds, add light they don't learn it
- they only learn when their are error, in predictions
- leads to causal learning, correlation isn't enough
- value of tone itself : like watching cooking show
- value of a causal outcome of the reward (food):
- this shows up after the signal (tone)
- This explains some of the irrational behaviour of drug users. Signals
- They will happily press a button to hear the bell even though they know they are full and don't want food. They might even "enjoy" hearing the bell because it reminds them of food.
- But the light, which is also associated by inference to food arriving, doesn't hold any value. They won't press the button to see the light.
- The order they learn about the light+sound and about the food, makes a difference here too!
- so are there completely diffenet learning (conditioning)
some types of learning associate value in itself and others are just about prediction and causation?
- are these results arising from particular brain structures or an algorithm? Is there even a difference?
Dopamine neurons encode suprise and cause learning:
- once the association is learned then when it predicts reward the dopamine fires anyways.
- If no reward also encodes dissapointment
- They linked this to TD(0) learning
- So dopamine is temporal difference value
- then can explicitly send TD errors into dopamine neurons and see effect
- it's like adding suprise even when they aren't suprised , or increasing the reward even though its the same food
- if they kill off dopamine entirely the rats do still learn, but it's reduced
- they also show it encourages sensory specific situations, so Q(s, a)
- the timing of when the dopamine armies seems to matter
- so small changes in when and how much d. p or reward is recieved can lead to huge differences.
Cleotilde Gonzalez , CMU
Whenever we do not choose the action that maximizes our reward based on or model of the world-values. Framing bias is relevant.
C. Klein, Oksana., Klinger
- human factors and ergonomics
- how-do people really make decisions
- expert decision makers, so quite rational but lots of knowledge ahead
time sensitive, huge messy domains
- Some conclusions:
- expert decision amkers often know what to do, they don't feel they really make a decision choice in the classical sense
- if experience doesn't provide solution then they give up
and run forward a simulator in their head based on experience
- tree search? MCTS? UCB?
Properties of DDM:
- utility can be dependent on the
- decisions overtime are interdependent
- limited time and cognitive resources
- delayed feedback
- choice - maximize total reward
- control-maintain system balance
Choices can be tried with no impact before a real consequential choice is made (eg. shopping for clothes)
The goal is to maintain a particular state of a 'stock' (eg. weight, temperature in environment)
(Gonzales, Hunan Factors, 2004)
They get people to play these very challenging games
then analyse their strategies and heuristics they use
1. How do memory and intelligence affect this?
- highly skilled people leave regardless
- low skilled people give up an rely on 'advised heuristic as time goes on
- No . People who learn with no time pressure learn better and perform better under time pressure
- learn slow, play fast
- learner who had more time are more willing to ignore simple heuristic advice once they master it
- Practice doesn't help under time pressure but in slow learning it helps a lot to be robust.
(Andersons Lebiere, 1998)
A model for combining Memory and Symbolic representations and how it happens in the human mind.
- the experiment is like a simplified Multi-armed Bandit task carried out on people
- Theydiscovered an interesting effect in human decision making
- you tell them the probabilities
- people overweight unlikely situations, Prospect Theory
- when people build their estimates from actual experience
- then they don't behave that way because ' under weight unlikely situations
- So Prospect Theory only seems to apply to how people apply probabities that are described to them
- create new meta states, "instances", to evaluate based on multiple memorized events that are similar to the current situation
- they have a Python library to define their models
Also Check out her awesome tutorial on RL for the people which gives a great top-to-bottom perspective on the current state of RL.
How do we focus is on treating learning concepts as RL?
She says Counterfactual learning is related to Batch learning and experience replay since both are learning based on old data.
We can't know what would have happened with different choices
TODO: Importune Sylyfr RL Policy Eva#
(Preap, 2. ⇒
weighting the trajectories using just a combination of the policy probabilities
see this, high variance
Stationary Importance Sampling (Challah and Manornor 2017)
- This is a new method that has lower variance than original approach
but is still hard to estimate.
So, if your models are bad then picking the MLE for the dynamics isn't a good idea
even using importance sampling has problems because it can have very high variance even though it isn't biased.
Unlike in supervised learning these are really hard in RL:
- structural risk minimization
- cross validation
There are promising methods for dealing with this in non i. i. d. domains but it's hard.
- doing policy gradients while using old data
- traditional approach for this leads to very high variance
- use importance sampling to reweight old trajectories and still converge
(Liu, Swaminathan UAI, 2019)
- they have a result for doing this using an advantage function
- restricted to domains with a single "when to act = decision (eg. when to start a drug treatment, when to sell a stock)
- one advantage of this is interpretability since plicies are more related to the actual human experience.
- discovering TD leraning for AI methods
- then discoering that this looks similar to what happens in the brain with dopamine neurons
- estaimtes of value at all states udpated in direction of improving prediction error
- traditional TD learning updates for all states (or neurons) with the same scale
- but in DTD they weight the updates using the local distribution of rewards somehow
- switch from mean value update to distributional value update
- they find that it seems like learning the distribution helps to learn a better representation
- animals receive one of seven amounts of food, with a prob distribution
- some animals get a signal associated with each case
- traditional TD learning: if reward is above average the positive learning happens
- but what about the distribution for each neuron / state?
- looking at experimental data it seems to align with what we'd expect from a distributional model rather than the old mean approach
Marlos Machado, Michael Bowling
- Function approximatoin requires that we really see examples of the different state we want values of. If we never see then then we can't learn it.
- one simple wayt o bootstrap this is useing proximity between states.
- but proximity can break in spatial domains or complex state spaces.
- what we really want is to talk about how many steps it would take to get between two 'nearby' states rahter than their euclidean distance
- this can be estimated with TD learning
- there is a good way to do function approximation on this as well
- The Success Representation (SR) naturally comes out of a the dual approach to dynamic programing for RL
- there is also some evidence taht it matches some of what is happening in the hippocampus
- this can be seen as an alternative to optimism under uncertainty used in R-Max and others
- add the L2 norm of the SR as an exploration bonus to standard SARSA
- intuition: if some state has not been visited much before it will get a bonus to encourage exploring it
- huge improvement on SARSA
- also works to add it to model based algorithms like , R-MAX etc
- adding this idea to DQN seems to help as well especially for domains for random exploration doesn't work well
Will Dabney, Deep Mind
Another talk on this approach in general.
- Distributed RL says we should learn the true distribution of the values.
- The means can be used directely to update value estimates using the bellman equation.
- But this doesn't work if you aren't using the mean (moments) because there may be multiple distributions that are consistent with that mean.
- So the big question is how to best impute the right distribution to explain the experiences.
- The way they've approached it is to fix the representation or projection to a consistent estimator and preserves the mean even though it's not necessarily the best one.
- helps maintain stability for complex domains for deep RL
- aids representation learning by providing a stronger signal about the structure of the domain
- it helps with improving generalization error, that is learning on some states can work well in very different unseen states, which can be shown to really help with improving performance in RL.
- see "Neuro-science Inspired AI" article by Demis Hassabis of Deep Mind.
- The communication between cognitive science, psychology, neuroscience and CS/AI helps us all to learn useful things and contribute to the overall truth
- Marc G. Bellemare, Will Dabney, Rémi Munos. 2017.
- Deep Mind 2019 Arxiv:
- DRL algorithms can be decomposed as the combination of some statistical estimator and a method for imputing a return distribution consistent with that set of statistics
Drug users seem to be risk seeking so their team modelled ambiguity as risk tolerance separately to explain people's varying values for money and drugs ambiguity tolerance. They founds this explains ongoing drug use better than risk tolerance.
William Fedus, Yoshua Bengio — Google Brain
- The standard discount factor is an exponential discounting
- is a hyperbolic discount
- (Souzo, 1998) use survival (t) rather than
- the probability of surviving until timestep t
- we can derive standard from this for a domain with a fixed risk of dying at each step
- but if the hazard rate depends a state we get other discount functions
- they simply the use of this by training the agent on multiple time horizons as an
- this is part of a larger discussion in RL that the common assumptions about the du't work
Goal: Promote behaviour changes on taking medication, reducing addiction etc.
Existing health support apps take two main forms:
- pull : just info, depends on user
- push: deliver intervention when needed
Very customized to personal schedule and context.
They tried to build an RL agent to push messages but not have the person get used to it.
This problem has interesting challenges:
- very noisy data, unknown variables, complex rewards
- delayed penalties (over seasitivation)
- immediately all actions are positive
Anna Harutyunyan, Doina Recep, Remi Munos, et. al. (DeepMind)
Reasoning at multiple timescales
- We can remember and reason about different levels of detail over time.
- why do we do it? The hope is that high level plans are more reusable.
option(O) = policy() + termination condition
Traditionally, policy is the focus, but termination also optimizes the same thing. Biases are added in to encourage options to last longer.
Their idea: a separate termination rule to be focussed on when to end option entirely.
- this is an MDP where we take options instead of actions
- provide a distribution over all states you could end up in
- however still need to learn the per-step Beta parameter
- they show how you can define the value function and Bellman equation for Beta then they can solve this with policy gradients.
Termination critic: They use the Actor-Critic approach but have a critic for the termination rule in addition to the policy.
(Pierre-Yves Oudeyer , INRIA)
Their lab look at intrinsic motivation in humans and machines. They explore developmental learning in children and try to apply it to
- building robots
- developing better education methods
"Interestingness" is not just about novelty or surprise.
It is about situations where a high level of learning is happening. If something becomes partially mastered then progress will slow and agent/child should/will lose interest.
(Oudeyer, IEEE TEC 2017)
- Build an interesting mess metric using change in gradient during learning for many points in state-action space.
- Choose actions to try which high values of this metric are best.
- Hierarchically divide the space into distinct regions by clustering on the metric.
(Forester et al. 2017) have a great video of robot learning.
They look at actual human social interactions for compelx dynamics between choices amongst multiple people.
- sequence of choices: join or don't join the interaction
- share of don't share : information
- they found correlations between activations in part of the brain related to prediction error
- they performed user study on trust game where people play with agents who have trustworthyness probabilities rather than actuall people
- they scanned people's brains during playing the game to see what lights up in the brain
- result: values are higher when interacting with someone you trust
- Recent study on Trust around the world, lost wallet game Cohn et al, 2019, Science
- that study found that people are more likely to return the wallet if there was a lot of money in it.
- Economists think there is not existing theory to explain this
- Psychological Game Theory can explain this
- one agent has second order theory about other player and they converge on a solution
- this involves thikning about the dissapointment the other person is going to experience and this is partially valid. If the money is higher obviously this weight is higher.
- TODO: see image
- They also find support for this by looking at brain scans
Many existing modles of theory of mind are low dimensional, a few main types of quealities and that they are static over time.
There is a push to explore this using Inverse Reinforcement Learning and Bayesian Learning
- goal is for player to learn the likely straggy of another player from overseved actions
- people can predict very quickly what watching people will do
- RL doing it to optimaize doesn't do well
- but IRL does better than RL here
- IRL does not perform as well as humans
- Also, this domain gave the IRL learner a lot of state representation information which humans don't have
- It's important to include guilt or theory of mind about other's dissapointment or pain
- It is a real effect in human decision making and it is robust across culture and does not conform to the standard economics idea of expected utility maximization
- Advice being given for medical and other safety-critical domains needs to consider this.
Learning of simulated languages between agents including programming language, natural or artificial ones.
Interesting work that explicitely builds compositional models of agents learning languages so that they contain some of the properties of natural languages.
Caroline J Charpentier; Kiyohito Iigaya; John O’Doherty
Learning by observing people perform a tasks, there are two main approaches:
- imitation learning
- learning by inferring about tier goals and preferences
- bandit task to choose which to pull, some feature about hte machines to identify them
- you get to watch another player follow a straegy and you know they are a good player
- you also know that one of the features (tokens) is perfectly correlated with high reward
- immitation is slower to learn, but better when system has lots of uncertainty
- emulation will be favoured for highly volatile domains
- people use both approaches
- they build a computational model for each strategy and an arbitration model which weights a tradeoff between the two strategies
- they show the arbitration model performs better
- then they test if it explains the human behaviour better and they find it idoes very closely
- they perform fMRI scans and show which parts of the brain correlate with activity for each of the two strategies and the joint arbitration signal too
Game: Hanabi - Michael explained how this card game blends the notion of communication between explicit messages and observation of the actions of other players
Working Memory is fast to use but has limited capacity and is forgotten quickly.
Give people a small set (3-7) of images to remember the position of based on another larger image with stimuatus in it.
Try to test two aspects of working memory if it is an RL system
- time limiting factors, how long it lasts
- Size of memory, how many different elements can be remembered
RL alone isn't enough, so we need some kind of mixture model. Once this is added then the model corresponds closely to human performance.
How do WM and RL interact?
- EEG studies show that RL reward prediction or reward history are correlated with the set size (the number of things trying to be remembered). So they are not independent. So WM is somehow blocking?
- EEG studies also show that the Q-value drops faster (improves faster) for small sets, so somehow WM is helping?
- Long term associations are learned better (this is harder) when the set of images is larger.
Keeping an open mind about how different observed brain systems could contribute to and interact with learning. Example domains help to motivate this:
- They can show that working memory is impaired in Schizophrenia patients and that it is due largely to WM problems seperated from RL, this is only visible if WM is modelled explicitely.
- Age related learning
- they found that learning rate seems to increase with age to compensate with decrease in working memory
Chelsea Finn, Berkelty, Google Brain, Stanford
A nice aspect of doing RL on robots is that you can't get around the problems of noise, bad reward models, generalization as you can in simulation. A major problem right now is training specialist robots is known but they generalize very badly even on mundane differences. So exploration, using raw sensory input and continual improvement without supervision are needed.
Their approach (Visual Foresight) is to do two things
- Learn general policies through unsupervised exploration
- Learn fundamental physics and dynamics from pre-existing videos
A standard dichotomy is Model-based (Habitual) vs Model-based (Planning)
How can a little bit of model-based knowledge help with planning?
- this is important because in reality we have an inifinity of choices and yes we only consider a small subset, how does that happen?
- consideration set - things we usually want to consider for this task, feasible options, but very restricted
- choice - this is the standard onine- value based estiamtion usin g amodel to pit the best thing for the context
We usually mean practical when we say possible, not if it is physically conceivable. They show some interesting experiemntal results that show immoral considerations are not immediately added to these consideration sets, but only available under more deliberation. So we don't even consider these scenarios to save time.
One final result is that this indicates that
We finally wrapped up with a talk by Rich Sutton himself. He argues for there being an truly Integrated Science of Mind that applies equally to biological and machine minds.
- it reduces the importance of subgoals
- it seems to be something that can't change over time
- first we learn value functions based on expected future reward
- we learn policies
- learn about state - eg. state representations
- skills - eg. options
- models -
He thinks play is an important way to look at it.
Three key open questions about subproblems in RL:
- Q1 - what should subproblems be
- Q2 - where do they come from
- Q3 - how do subproblems help the main problem (ie. how to subgoals help the main reward maximization task)
- Learning to solve subproblems could help with shaping better state representations, behaviour patterns that are more coherent
- It also allows high level planning because now you have a model of what happens after you've achieved the subgoal
- subproblems are a reward in themselves and may be terminal, planning stops
- solving a subproblem can be done with an option, a separately subpolicy
That's All, it was fun. See you in two years at Brown!