Q-Learning Pac-Man Agent

Teaching an agent to play Pac-Man using reinforcement learning

November 2024 completed

Technologies: Python Reinforcement Learning Q-Learning NumPy

Q-Learning Pac-Man Agent

A reinforcement learning project that implements a Q-Learning agent to play the classic Pac-Man game, demonstrating fundamental RL concepts and successful policy learning.

Project Overview

This project implements a Q-Learning algorithm that learns to play Pac-Man through trial and error, gradually improving its strategy to maximize score while avoiding ghosts.

Motivation

Pac-Man provides an ideal environment for demonstrating reinforcement learning concepts:

Clear reward structure: Points for pellets, penalties for getting caught
Strategic depth: Balancing exploration (eating pellets) with safety (avoiding ghosts)
Observable state space: Position, ghost locations, pellet distribution
Classic appeal: Everyone understands the game mechanics

Technical Implementation

State Representation

The agent perceives the game state through several key features:

Current position (x, y coordinates)
Distance to nearest ghost
Direction of nearest ghost
Distance to nearest pellet
Number of pellets remaining
Ghost state (scared/normal)

def get_features(state):
    features = {
        'distance_to_ghost': min_distance_to_ghost(state),
        'distance_to_food': min_distance_to_food(state),
        'ghost_direction': get_ghost_direction(state),
        'food_count': len(state.food),
        'scared_ghost': any(ghost.scared for ghost in state.ghosts)
    }
    return features

Q-Learning Algorithm

The agent uses the classic Q-Learning update rule:

Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Where:

$\alpha$ is the learning rate (0.2)
$\gamma$ is the discount factor (0.9)
$r$ is the immediate reward
$s’$ is the next state

class QLearningAgent:
    def __init__(self, alpha=0.2, gamma=0.9, epsilon=0.1):
        self.q_values = {}
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

    def update(self, state, action, reward, next_state):
        current_q = self.get_q_value(state, action)
        max_next_q = max(self.get_q_value(next_state, a)
                        for a in self.legal_actions(next_state))

        new_q = current_q + self.alpha * (
            reward + self.gamma * max_next_q - current_q
        )

        self.q_values[(state, action)] = new_q

Exploration Strategy

The agent uses epsilon-greedy exploration:

With probability $\epsilon$: choose random action (explore)
With probability $1 - \epsilon$: choose best known action (exploit)

def choose_action(self, state):
    if random.random() < self.epsilon:
        return random.choice(self.legal_actions(state))
    else:
        return self.get_best_action(state)

Training Process

Episode Structure

Each training episode consists of:

Initialize game state
Agent chooses action
Environment responds with reward and next state
Agent updates Q-values
Repeat until game ends (win/loss)

Learning Curves

The agent shows clear learning progress over episodes:

Early episodes (1-100): Random, frequently caught by ghosts
Mid training (100-500): Learns basic ghost avoidance
Late training (500-1000): Develops efficient pellet collection strategies

Learning Curve

Results

Performance Metrics

After 1000 training episodes:

Average score: 850 points (baseline: 200)
Win rate: 75% (up from 10%)
Average survival time: 3.5 minutes
Pellets collected: 90% average

Learned Behaviors

The agent develops several intelligent strategies:

Ghost avoidance: Maintains safe distance from normal ghosts
Power pellet prioritization: Seeks power pellets when ghosts are close
Efficient paths: Learns to clear pellets systematically
Scared ghost chasing: Actively hunts scared ghosts for bonus points

Technical Challenges

Challenge 1: State Space Explosion

Problem: Continuous state space (positions) makes exact Q-values infeasible.

Solution: Feature-based approximation using meaningful game features rather than raw positions.

Challenge 2: Reward Shaping

Problem: Sparse rewards (only at pellets/ghosts) slow learning.

Solution: Added intermediate rewards:

Small positive reward for moving toward pellets
Negative reward for staying still
Gradual penalty for time elapsed

Challenge 3: Exploration-Exploitation Balance

Problem: Too much exploration wastes learning; too little gets stuck in local optima.

Solution: Epsilon-decay schedule:

epsilon = max(0.01, 0.1 * (0.995 ** episode))

Code Structure

qlearning-pacman/
├── agents/
│   ├── qlearning_agent.py
│   └── feature_extractor.py
├── game/
│   ├── pacman.py
│   └── game_state.py
├── utils/
│   ├── visualization.py
│   └── metrics.py
├── trained_models/
│   └── qvalues.pkl
├── tests/
│   └── test_agent.py
└── README.md

Future Improvements

Potential enhancements to explore:

Deep Q-Learning: Neural network for function approximation
Policy gradient methods: REINFORCE or Actor-Critic
Multi-agent: Multiple Pac-Men or cooperative ghosts
Transfer learning: Apply learned policy to variations
Curriculum learning: Progressive difficulty increase

Educational Value

This project demonstrates:

Core RL concepts (states, actions, rewards, policies)
Temporal difference learning
Function approximation techniques
Trade-offs in algorithm design
Practical debugging of RL systems

Resources

Acknowledgments

Inspired by UC Berkeley’s CS188 Pac-Man projects and the classic reinforcement learning literature.

Tech Stack: Python, NumPy, Matplotlib, Pickle Development Time: 3 months Status: Complete and open-sourced Last Updated: November 2024

Outcomes & Results

Agent learns optimal navigation strategies
Demonstrates exploration vs. exploitation trade-off
Open-sourced with comprehensive documentation