Q-Learning, Deep Q-Networks (DQN), and Their Role in NR-V2X
1. What Is Q‑Learning? (Simple)
Imagine a robot navigating a maze. At each position (state), it can take an action (move up/down/left/right). Some actions give rewards (like +10 for reaching the goal), others give penalties. The robot doesn’t know the best path at first — it must learn by trying actions and observing rewards.
Q-Learning helps the robot learn the value of taking each action in each state, stored in a Q-table:
- Q(s,a) – the robot’s current estimate: “If I’m in state s and take action a, how good is it long-term?”
- Q*(s,a) – the optimal total reward: “If I take action a in state s and then act optimally forever after, how much total reward could I get?”
Over many trials, Q(s,a) updates to approach Q*(s,a), helping the robot learn the best action in each situation.
Figure 1: Robot navigating a maze, showing states, actions, and rewards — the intuition behind Q(s,a) and Q*(s,a).
1b. Q-Learning Agent Algorithm
function Q-LEARNING-AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s and reward signal r
persistent: Q, a table of action values indexed by state and action, initially zero
Nsa, a table of frequencies for state–action pairs, initially zero
s, a, r, the previous state, action, and reward, initially null
if TERMINAL?(s) then Q[s,None] ← r
if s is not null then
increment Nsa[s,a]
Q[s,a] ← Q[s,a] + α(Nsa[s,a]) * (r + γ maxa Q[s',a'] − Q[s,a])
s, a, r ← s, argmaxa f(Q[s,a], Nsa[s,a]), r
return a
Figure 2: An exploratory Q-learning agent. It learns Q(s,a) values and selects actions using an exploration function, without needing a full model of state transitions. (Source: Russell & Norvig, 2010, Artificial Intelligence: A Modern Approach, 3rd Edition)
2. The Bellman Equation — The Heart of Q‑Learning
The Bellman equation is a recursive rule stating:
Value of an action = Immediate reward + discounted estimate of future rewards
Formally, the optimal Q-value is:
Q*(s,a) = R(s,a) + γ * max_a' Q*(s',a')
Where:
- s = current state
- a = action taken
- s' = next state
- R(s,a) = immediate reward
- γ = discount factor (0–1, how much future rewards matter)
- a' = best action in the next state
Intuition: Q*(s,a) captures both the reward now and the best possible future rewards. Q(s,a) tries to estimate it using the update rule:
Q(s,a) ← Q(s,a) + α * [ r + γ * max_a' Q(s',a') - Q(s,a) ]
- α = learning rate
- r = reward received this step
- r + γ * max_a' Q(s',a') = the target value, an estimate of Q*(s,a)
3. Limitations of Q-Learning in Large Environments
Q-Learning works fine in small mazes or grids, but struggles with large or complex state spaces (like thousands of vehicle sensors, images, or dynamic V2X traffic scenarios). The Q-table becomes impractical → neural networks come to the rescue.
4. What Is a Deep Q-Network (DQN)?
A DQN replaces the Q-table with a neural network that approximates the Q-function:
- Input: current state (sensor readings, images, traffic info)
- Network predicts Q-values for all possible actions
- Action Selection: pick action with highest predicted Q-value (or explore randomly)
- Experience Replay: stores transitions (s,a,r,s') for stable learning
- Target Network: computes stable targets from Bellman equation
Target formula (based on Bellman):
target = r + γ * max_a' Q_target(s',a')
The network updates to minimize the difference between predicted Q and this target, allowing learning in huge, complex environments.
5. Benefits of Q-Learning / DQN in NR-V2X
- Dynamic Resource Allocation – Select channels/time slots for vehicles to reduce collisions and interference.
- Adaptive Power Control – Adjust transmit power based on vehicle distance and traffic to save energy and improve links.
- Intelligent Handover – Predict the best time and base station for handover to minimize dropped calls.
- Latency & Throughput Optimization – Balance URLLC (safety) and high-throughput infotainment dynamically.
- Robustness to Dynamic Environments – Adapt to changing traffic, interference, and mobility patterns.
6. Student-Friendly Intuition
- Q-Learning: memorizing scores for every move in every state.
- Bellman Equation: updates scores by combining immediate reward and expected future reward (Q*(s,a)).
- DQN: teaches a neural network to generalize “scoring moves” for huge environments.
Think of it as turning a giant lookup table into a smart brain that predicts the best actions anywhere.
References
- Russell, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd Edition). Pearson.
- YouTube Video explaining Bellman equation & DQN: https://www.youtube.com/watch?v=kEGAMppyWkQ&t=296s
Comments
Post a Comment