Q-Learning, Deep Q-Networks (DQN), and Their Role in NR-V2X

1. What Is Q‑Learning? (Simple)

Imagine a robot navigating a maze. At each position (state), it can take an action (move up/down/left/right). Some actions give rewards (like +10 for reaching the goal), others give penalties. The robot doesn’t know the best path at first — it must learn by trying actions and observing rewards.

Q-Learning helps the robot learn the value of taking each action in each state, stored in a Q-table:

  • Q(s,a) – the robot’s current estimate: “If I’m in state s and take action a, how good is it long-term?”
  • Q*(s,a) – the optimal total reward: “If I take action a in state s and then act optimally forever after, how much total reward could I get?”

Over many trials, Q(s,a) updates to approach Q*(s,a), helping the robot learn the best action in each situation.

Robot navigating a maze illustration

Figure 1: Robot navigating a maze, showing states, actions, and rewards — the intuition behind Q(s,a) and Q*(s,a).

1b. Q-Learning Agent Algorithm

function Q-LEARNING-AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s and reward signal r
persistent: Q, a table of action values indexed by state and action, initially zero
            Nsa, a table of frequencies for state–action pairs, initially zero
            s, a, r, the previous state, action, and reward, initially null

if TERMINAL?(s) then Q[s,None] ← r
if s is not null then
    increment Nsa[s,a]
    Q[s,a] ← Q[s,a] + α(Nsa[s,a]) * (r + γ maxa Q[s',a'] − Q[s,a])

s, a, r ← s, argmaxa f(Q[s,a], Nsa[s,a]), r
return a

Figure 2: An exploratory Q-learning agent. It learns Q(s,a) values and selects actions using an exploration function, without needing a full model of state transitions. (Source: Russell & Norvig, 2010, Artificial Intelligence: A Modern Approach, 3rd Edition)

2. The Bellman Equation — The Heart of Q‑Learning

The Bellman equation is a recursive rule stating:

Value of an action = Immediate reward + discounted estimate of future rewards

Formally, the optimal Q-value is:

Q*(s,a) = R(s,a) + γ * max_a' Q*(s',a')

Where:

  • s = current state
  • a = action taken
  • s' = next state
  • R(s,a) = immediate reward
  • γ = discount factor (0–1, how much future rewards matter)
  • a' = best action in the next state

Intuition: Q*(s,a) captures both the reward now and the best possible future rewards. Q(s,a) tries to estimate it using the update rule:

Q(s,a) ← Q(s,a) + α * [ r + γ * max_a' Q(s',a') - Q(s,a) ]
  • α = learning rate
  • r = reward received this step
  • r + γ * max_a' Q(s',a') = the target value, an estimate of Q*(s,a)

3. Limitations of Q-Learning in Large Environments

Q-Learning works fine in small mazes or grids, but struggles with large or complex state spaces (like thousands of vehicle sensors, images, or dynamic V2X traffic scenarios). The Q-table becomes impractical → neural networks come to the rescue.

4. What Is a Deep Q-Network (DQN)?

A DQN replaces the Q-table with a neural network that approximates the Q-function:

  • Input: current state (sensor readings, images, traffic info)
  • Network predicts Q-values for all possible actions
  • Action Selection: pick action with highest predicted Q-value (or explore randomly)
  • Experience Replay: stores transitions (s,a,r,s') for stable learning
  • Target Network: computes stable targets from Bellman equation

Target formula (based on Bellman):

target = r + γ * max_a' Q_target(s',a')

The network updates to minimize the difference between predicted Q and this target, allowing learning in huge, complex environments.

5. Benefits of Q-Learning / DQN in NR-V2X

  • Dynamic Resource Allocation – Select channels/time slots for vehicles to reduce collisions and interference.
  • Adaptive Power Control – Adjust transmit power based on vehicle distance and traffic to save energy and improve links.
  • Intelligent Handover – Predict the best time and base station for handover to minimize dropped calls.
  • Latency & Throughput Optimization – Balance URLLC (safety) and high-throughput infotainment dynamically.
  • Robustness to Dynamic Environments – Adapt to changing traffic, interference, and mobility patterns.

6. Student-Friendly Intuition

  • Q-Learning: memorizing scores for every move in every state.
  • Bellman Equation: updates scores by combining immediate reward and expected future reward (Q*(s,a)).
  • DQN: teaches a neural network to generalize “scoring moves” for huge environments.

Think of it as turning a giant lookup table into a smart brain that predicts the best actions anywhere.

References

Comments

Popular posts from this blog

From DSRC to 5G NR-V2X: The Road Ahead for Connected Vehicles

CTE 311: ENGINEER IN SOCIETY: CURRICULUM (20/21 SESSION)