Reinforcement Learning: An Introduction to Learning from Rewards
Discover reinforcement learning—the paradigm where agents learn to maximize expected rewards through experience, from robot locomotion to DeepRacer vehicles.
In our exploration of intelligent systems, we’ve examined approaches where the correct answer is known (supervised learning) and where patterns emerge from data without labels (unsupervised learning). Today, we explore a fundamentally different paradigm: reinforcement learning (RL), where an agent learns to make decisions by receiving feedback in the form of rewards.
Reinforcement learning powers some of the most impressive achievements in modern AI: from AlphaGo defeating world champions at Go, to robots learning to walk and manipulate objects, to the AWS DeepRacer autonomous vehicles. By the end of this post, you’ll understand the core ideas behind RL, how it differs from other machine learning paradigms, and the key approaches to learning from experience.
The Reinforcement Learning Framework
At its heart, reinforcement learning addresses a deceptively simple question: how should an agent behave to maximize long-term rewards?
Unlike supervised learning, where we provide the algorithm with correct examples, or unsupervised learning, where we seek hidden structure, RL involves an agent interacting with an environment over time. At each step:
- The agent observes the current state $s$ of the environment
- Based on this state, the agent selects an action $a$
- The environment transitions to a new state $s’$ and provides a reward $r$
graph TD
A[Agent] -->|Actions a| E[Environment]
E -->|State s| A
E -->|Reward r| A
style A fill:#bbdefb
style E fill:#fff3e0
The agent’s goal is to learn a policy $\pi(s)$—a mapping from states to actions—that maximizes the expected sum of discounted rewards over time. This is formalized through Markov Decision Processes (MDPs), which we assume have the following components:
- States $s \in S$: All possible situations the agent can encounter
- Actions $a \in A(s)$: Actions available in each state
- Transition model $T(s, a, s’)$: Probability of reaching $s’$ from $s$ by taking $a$
- Reward function $R(s, a, s’)$: Immediate reward received
- Discount factor $\gamma \in [0, 1)$: How much we value immediate vs. future rewards
The critical difference from our previous MDP discussions: in RL, we don’t know $T$ or $R$! We must learn them through experience.
Learning to Walk: RL in Action
To make this concrete, consider the classic example of a robot learning to walk. Initially, the robot’s movements are clumsy—it stumbles, falls, takes erratic steps. But each fall provides negative feedback (or each successful step provides positive feedback), and over thousands of trials, the robot learns a coordinated walking pattern.
The same principle applies to the Crawler project you might encounter in Project 3: starting from random movements, an agent learns to navigate obstacles and reach goals through trial and error.
Another impressive example is Sidewinding, where robots learn efficient snake-like locomotion, and DeepRacer, Amazon’s 1/18th scale autonomous race car that learns to navigate tracks through RL:
DeepRacer demonstrates RL at scale—cars train in simulation, then deploy to physical devices that compete in global races.
Model-Based vs. Model-Free Learning
Reinforcement learning approaches fall into two broad categories:
Model-Based Learning
In model-based RL, the agent first learns an approximation of the environment’s dynamics (the transition model $T$ and reward function $R$), then uses this learned model to plan optimal behavior.
The approach:
- Learn the model: Observe many transitions $(s, a, s’, r)$ and estimate $T(s, a, s’)$ and $R(s, a, s’)$ empirically
- Solve the MDP: Apply familiar algorithms (value iteration, policy iteration) to the learned model
Example: Model-Based Learning
Suppose we observe these episodes:
| Episode | Transitions Observed |
|---|---|
| 1 | $B \xrightarrow{east} C, -1 \to D, -1 \to exit, +10$ |
| 2 | $B \xrightarrow{east} C, -1 \to D, -1 \to exit, +10$ |
| 3 | $E \xrightarrow{north} C, -1 \to A, -1 \to exit, -10$ |
| 4 | $E \xrightarrow{north} C, -1 \to D, -1 \to exit, +10$ |
From these, we estimate:
- $T(B, east, C) = 1.0$ (always observed)
- $T(C, east, D) = 0.75$ (observed 3 out of 4 times)
- $T(C, east, A) = 0.25$ (observed 1 out of 4 times)
- $R(D, exit, x) = +10$
- $R(A, exit, x) = -10$
graph TD
subgraph "Learned Model"
B["B<br/>Value: +4"] -->|"east (100%)"| C["C<br/>Value: +8"]
C -->|"east (75%)"| D["D<br/>Value: +10"]
C -->|"east (25%)"| A["A<br/>Value: -10"]
D -->|"exit"| X["Exit<br/>+10"]
A -->|"exit"| Y["Exit<br/>-10"]
E["E"] -->|"north"| C
end
style B fill:#bbdefb
style C fill:#bbdefb
style D fill:#c8e6c9
style A fill:#ffcdd2
Model-Free Learning
In model-free RL, we skip learning the model entirely and directly learn values or policies from experience. This is often more practical when the state space is huge or the model would be too expensive to learn.
The key insight: we don’t need to know why a state is good or what actions do—we can learn directly from observed outcomes.
Passive Reinforcement Learning: Learning from a Fixed Policy
Let’s start with a simplified task: passive RL or policy evaluation. Here, we don’t choose actions—we’re given a fixed policy $\pi(s)$ and must learn the values of states under that policy.
Setup:
- Input: A fixed policy $\pi(s)$ telling us which action to take in each state
- Unknown: Transition model $T(s, a, s’)$ and reward function $R(s, a, s’)$
- Goal: Learn the state values $V^\pi(s)$
The direct evaluation approach:
- Execute the policy $\pi$ in the environment
- Every time we visit a state, record the total discounted reward from that point onward
- Average together all observed values for each state
Direct Evaluation Example
Given the same episodes as before, with $\gamma = 1$ (no discounting):
Episode 1: $B \to C \to D \to exit$ with rewards $-1, -1, +10 = +8$ Episode 2: Same as Episode 1: $+8$ Episode 3: $E \to C \to A \to exit$ with rewards $-1, -1, -10 = -12$ Episode 4: $E \to C \to D \to exit$ with rewards $-1, -1, +10 = +8$
Computing averages for each state:
| State | Observed Values | Average $V(s)$ |
|---|---|---|
| A | -10 | -10 |
| B | 8, 8 | +8 |
| C | 8, -12, 8 | +4 |
| D | 10, 10 | +10 |
| E | -12, 8 | -2 |
graph TD
A["A<br/>V = -10"] -->|"exit"| X1["Exit<br/>+10"]
B["B<br/>V = +8"] -->|"east"| C["C<br/>V = +4"]
C -->|"east (75%)"| D["D<br/>V = +10"]
C -->|"east (25%)"| A
D -->|"exit"| X2["Exit<br/>+10"]
E["E<br/>V = -2"] -->|"north"| C
style D fill:#c8e6c9
style B fill:#c8e6c9
style A fill:#ffcdd2
style E fill:#fff3e0
Direct evaluation is simple but has limitations: it can be slow to converge and doesn’t reuse information efficiently between states.
Temporal Difference Learning: Learning from Differences
A more powerful approach is Temporal Difference (TD) Learning, which combines ideas from dynamic programming and Monte Carlo methods. TD learning updates estimates based on other estimates, allowing for faster convergence than direct evaluation.
In TD learning, after observing a transition $(s, a, s’, r)$, we update:
This “bootstrapping” approach—learning from predictions to improve predictions—is the foundation of Q-learning, which we’ll explore in future lectures.
Why Reinforcement Learning Matters
Reinforcement learning addresses problems where:
- Correct actions aren’t known: Unlike classification, there’s no “ground truth” answer
- Delayed consequences matter: Actions may have long-term effects not immediately visible
- Exploration is necessary: Agents must try actions they haven’t tried before
- Interaction is key: Learning happens through interaction with an environment
These characteristics make RL ideal for:
- Game playing: Learning optimal strategies through self-play
- Robotics: Learning motor control for manipulation and locomotion
- Recommendation systems: Learning what content users prefer over time
- Resource management: Allocating computing, energy, or other resources
- Language assistants: Learning to respond helpfully to user queries
Conclusion
You’ve now been introduced to the fundamentals of reinforcement learning. We covered:
- The agent-environment framework with states, actions, rewards
- The distinction between model-based and model-free learning
- Passive RL and direct evaluation for learning from a fixed policy
- How RL differs from supervised and unsupervised learning
In upcoming lectures, we’ll explore:
- Active RL: How to choose actions to learn more efficiently
- Q-learning: Model-free learning of optimal action values
- Deep RL: Scaling RL with neural networks
Practice suggestions: Try implementing direct evaluation on a simple MDP, and think about how you would modify the approach for discounted rewards ($\gamma < 1$).
External Resources
-
Berkeley AI Research (BAIR) Blog - Original source of these lecture materials; excellent RL tutorials and research updates.
-
OpenAI Gym - A toolkit for developing and comparing RL algorithms; perfect for hands-on experimentation.
-
AWS DeepRacer - Amazon’s RL-powered racing platform; demonstrates RL in action.
-
Spinning Up in Deep RL (OpenAI) - Educational resource for understanding deep reinforcement learning.