ABDULHAMEED IDRIS ADEDAMOLA

Q-Learning, Deep Q-Networks (DQN), and Their Role in NR-V2X

1. What Is Q‑Learning? (Simple)

Imagine a robot navigating a maze. At each position (state), it can take an action (move up/down/left/right). Some actions give rewards (like +10 for reaching the goal), others give penalties. The robot doesn’t know the best path at first — it must learn by trying actions and observing rewards.

Q-Learning helps the robot learn the value of taking each action in each state, stored in a Q-table:

Q(s,a) – the robot’s current estimate: “If I’m in state s and take action a, how good is it long-term?”
Q*(s,a) – the optimal total reward: “If I take action a in state s and then act optimally forever after, how much total reward could I get?”

Over many trials, Q(s,a) updates to approach Q*(s,a), helping the robot learn the best action in each situation.

Figure 1: Robot navigating a maze, showing states, actions, and rewards — the intuition behind Q(s,a) and Q*(s,a).

1b. Q-Learning Agent Algorithm

function Q-LEARNING-AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s and reward signal r
persistent: Q, a table of action values indexed by state and action, initially zero
            Nsa, a table of frequencies for state–action pairs, initially zero
            s, a, r, the previous state, action, and reward, initially null

if TERMINAL?(s) then Q[s,None] ← r
if s is not null then
    increment Nsa[s,a]
    Q[s,a] ← Q[s,a] + α(Nsa[s,a]) * (r + γ maxa Q[s',a'] − Q[s,a])

s, a, r ← s, argmaxa f(Q[s,a], Nsa[s,a]), r
return a

Figure 2: An exploratory Q-learning agent. It learns Q(s,a) values and selects actions using an exploration function, without needing a full model of state transitions. (Source: Russell & Norvig, 2010, Artificial Intelligence: A Modern Approach, 3rd Edition)

2. The Bellman Equation — The Heart of Q‑Learning

The Bellman equation is a recursive rule stating:

Value of an action = Immediate reward + discounted estimate of future rewards

Formally, the optimal Q-value is:

Q*(s,a) = R(s,a) + γ * max_a' Q*(s',a')

Where:

s = current state
a = action taken
s' = next state
R(s,a) = immediate reward
γ = discount factor (0–1, how much future rewards matter)
a' = best action in the next state

Intuition: Q*(s,a) captures both the reward now and the best possible future rewards. Q(s,a) tries to estimate it using the update rule:

Q(s,a) ← Q(s,a) + α * [ r + γ * max_a' Q(s',a') - Q(s,a) ]

α = learning rate
r = reward received this step
r + γ * max_a' Q(s',a') = the target value, an estimate of Q*(s,a)

3. Limitations of Q-Learning in Large Environments

Q-Learning works fine in small mazes or grids, but struggles with large or complex state spaces (like thousands of vehicle sensors, images, or dynamic V2X traffic scenarios). The Q-table becomes impractical → neural networks come to the rescue.

4. What Is a Deep Q-Network (DQN)?

A DQN replaces the Q-table with a neural network that approximates the Q-function:

Input: current state (sensor readings, images, traffic info)
Network predicts Q-values for all possible actions
Action Selection: pick action with highest predicted Q-value (or explore randomly)
Experience Replay: stores transitions (s,a,r,s') for stable learning
Target Network: computes stable targets from Bellman equation

Target formula (based on Bellman):

target = r + γ * max_a' Q_target(s',a')

The network updates to minimize the difference between predicted Q and this target, allowing learning in huge, complex environments.

5. Benefits of Q-Learning / DQN in NR-V2X

Dynamic Resource Allocation – Select channels/time slots for vehicles to reduce collisions and interference.
Adaptive Power Control – Adjust transmit power based on vehicle distance and traffic to save energy and improve links.
Intelligent Handover – Predict the best time and base station for handover to minimize dropped calls.
Latency & Throughput Optimization – Balance URLLC (safety) and high-throughput infotainment dynamically.
Robustness to Dynamic Environments – Adapt to changing traffic, interference, and mobility patterns.

6. Student-Friendly Intuition

Q-Learning: memorizing scores for every move in every state.
Bellman Equation: updates scores by combining immediate reward and expected future reward (Q*(s,a)).
DQN: teaches a neural network to generalize “scoring moves” for huge environments.

Think of it as turning a giant lookup table into a smart brain that predicts the best actions anywhere.

References

Russell, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd Edition). Pearson.
YouTube Video explaining Bellman equation & DQN: https://www.youtube.com/watch?v=kEGAMppyWkQ&t=296s

Search This Blog