Project Q Star (Q*) has recently been in the spotlight, led by OpenAI it could be a significant leap towards** **Artificial General Intelligence (AGI)** **a.k.a achieving Human-level intelligence. Recent speculations** **about Sam Altman’s firing have raised questions about the potential dangers of Artificial Intelligence across several industries including job unemployment. It is therefore necessary to understand the complexities involved in the development of AGI. In this blog, we will cover in detail, the foundation of Project (Q*), the Q-learning algorithm, and why OpenAI adopted it to achieve AGI. Hang tight!

## Short Introduction: Reinforcement Learning

A reinforcement learning system has four main subelements:** **a policy (π), a reward signal (r), a value function (v), and, optionally, an environment model. A policy is a mapping from perceived states of the environment to actions to be taken in those states. A reward signal defines the goal in a reinforcement learning problem. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines the good and bad events for the agent.

s: state

a: action

S: a set of all nonterminal states

S +: set of all states, including the terminal state

A(s): set of actions possible in state s

R: set of possible rewards

There are two foundational pathways through which the optimal policy *(Q/π)* can be obtained, On-Policy Learning **(**Markov Decision Process -MDP) and Off-Policy Learning (Q-Learning).

At each time step, the agent implements a mapping from states to probabilities of selecting each possible action, denoted *πt*. The agent’s goal, roughly speaking, is to maximize the total reward it receives over the long run.

## On-Policy Learning: Markov Decision Process (Finite MDP)

If the environment’s response at** t + 1 **depends only on the state and action representations at

**, in which case the environment’s dynamics can be defined by specifying only**

*t*

p(s0, r|s, a) = Pr{Rt+1 = r, St+1 = s0 | St, At}

For all *r, s0, St, and At,* one can show that, by iterating this equation, one can predict all future states and expected rewards from knowledge only of the current state as well as would be possible given the complete history up to the current time, such as the expected rewards for state–action pairs.

## Value Functions

Value functions — functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state.

Similarly, we define the value of taking action an in state s under a policy π, denoted qπ(s, a), as the expected return starting from s, taking the action a, and there after following policy π

Recall *π*, is a mapping from each state, *s ∈ S*, and action,* a ∈ A(s)*, to the probability *π(a|s)* of taking action a when in state s. Informally, the value of a state s under a policy *π*, denoted *Vπ(s)*.

## Optimal Value Functions (V*) using Bellman’s Equation

A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy particular recursive relationships. For any policy π and any state s, the following consistency condition holds between the value of s and the value of its possible successor states: Bellman equation for *Vπ*. It expresses a relationship between a state’s value and its successor states’ values. Think of looking ahead from one state to its possible successor states

**Optional: Generalized policy iteration**

Generalized policy iteration (GPI) refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Value (V)and policy (π) functions interact until they are optimal and thus consistent with each other.

**Off-Policy Learning: Q- Learning Algorithm**

One of the most important breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989).

In this case, the learned action-value function, Q, directly approximates q∗, the optimal action-value function, independent of the policy being followed. The policy still has an effect in that it determines which state-action pairs are visited and updated. However, all that is required for correct convergence is that all pairs continue to be updated.

## Steps involved in determining the optimal action value:

**Step One: Initializing the Q-table – **Create a table or matrix to store the Q-values for each state-action pair. Set all Q-values to initial values (e.g., 0 or small random values). Choose an initial state: Start the agent in an initial state within the environment.

**Step Two: Select an action – **Use an action selection strategy (e.g., epsilon-greedy) to choose an action for the current state. With probability epsilon, select a random action (exploration). With probability 1-epsilon, select the action with the highest Q-value for the current state (exploitation).

**Step Three: Perform the selected action –** Execute the chosen action in the environment. Observe the next state and the reward received.

**Step Four: Update the Q-value –** Calculate the target Q-value using the Bellman equation:

Q(s, a) = Q(s, a) + α * [R(s, a) + γ * max(Q(s’, a’)) — Q(s, a)]

*Q(s, a)* is the current Q-value for the state-action pair. *α* is the learning rate, which determines the extent to which new information overrides old information.R(s, a) is the reward received for taking action in an in-state s. γ is the discount factor, which determines the importance of future rewards. max *(Q(s’, a’))* is the maximum Q-value for the next state s’ across all possible actions a’.

**Step Five: Transition to the next state – **Move the agent to the next state based on the action taken and the environment dynamics. Continue the process of selecting actions, performing actions, updating Q-values, and transitioning to new states. Repeat these steps for a specified number of episodes or until a termination condition is met.

**Step Six: Exploration and exploitation – **During the learning process, balance exploration and exploitation using the epsilon-greedy strategy. Gradually decrease the value of epsilon over time to shift from exploration to exploitation.

**Step Seven: Evaluate the learned policy –**After the learning process, evaluate the performance of the learned policy by running the agent in the environment without updating the Q-values. Measure the cumulative rewards or other relevant metrics to assess the quality of the learned policy.

**Final Step: Fine-tune and optimize –**Experiment with different hyperparameters, such as the learning rate, discount factor, and exploration rate, to improve the learning process and the quality of the learned policy. Consider techniques like experience replay, target networks, or double Q-learning to stabilize and enhance the learning process.

Cheers! 🙂