Personal tools

The RL Process

The RL Process_030223A
[The Reinforcement Learning Process - FreeCodeCamp]


- Overview

In reinforcement learning (DL), developers devise a method of rewarding desired behaviors and punishing negative behaviors. This method assigns positive values to the desired actions to encourage the agent to use them, while negative values are assigned to undesired behaviors to discourage them. This programs the agent to seek long-term and maximum overall rewards to achieve an optimal solution. 


- Reinforcement Learning Algorithms

Reinforcement learning (RL) types can be implemented in three ways.

  • Value-based: The goal of value-based reinforcement learning methods is to optimize the value function V(s). In this strategy, the agent expects long-run rewards from the current policy state.
  • Policy-based: In this reinforcement learning model, you try to come up with a policy that will give you the greatest reward in the future by taking actions at each state.
  • Model-based: In this reinforcement learning approach, you have to develop a virtual model for each environment. The agent learns how to perform in that particular setting.


- The RL Process

RL models are taught to make a series of judgments by learning. In unpredictable and potentially complex environments, agents must learn to achieve goals. AI is placed in a game-like environment while learning reinforcement. To find solutions to problems, computers use trial and error.

AI is rewarded or punished for the actions it takes in order for it to do what its programmers want it to do. The aim is to maximize the total winnings as much as possible. 

Although the designer made the reward policy (i.e. the rules of the game), he/she gave the model no hints or ideas on how to solve the game.

Starting with completely random trials, and progressing to complex strategies and superhuman skills, it is up to the model to figure out how to complete the task to maximize reward. Reinforcement learning is currently the most effective technique for implying machine creativity by harnessing the power of search and multiple trials. 

Unlike humans, AI may gain experience from thousands of simultaneous games if a RL algorithm is executed on a powerful computer infrastructure.

Let's imagine an agent learning to play Super Mario Bros. (Super Mario Bros. is a platform game developed and published by Nintendo for the Nintendo Entertainment System) as a working example. The reinforcement learning (RL) process can be modeled as a loop, which works as follows:

  • Our agent receives state S0 from the environment (in our case we receive the first frame of the game (state) from Super Mario Bros. (environment))
  • Based on this state S0, the agent takes action A0 (our agent will move to the right)
  • The environment transitions to a new state S1 (new frame)
  • The environment gives the agent some reward R1 (not dead: +1)


- The Central Idea of the Reward Hypothesis

Why is the agent's goal to maximize the expected cumulative reward? 

This RL loop outputs a sequence of states, actions, rewards, and next states. The agent's goal is to maximize its cumulative reward, called expected return. Because RL is based on the reward assumption, all goals can be described as maximization of expected cumulative reward. This is why in reinforcement learning, for optimal behavior, we need to maximize the expected cumulative reward.


- Episodic or Continuing Tasks

A task is an instance of an RL problem. We can have two types of tasks: episodic and continuous. 

  • Episodic Tasks: In this case we have a start and an end (terminal state). This creates a plot: a list of states, actions, rewards, and new states. For example, think of Super Mario Bros., an episode that starts with the release of New Mario and ends: when you get killed or reach the end of the level.
  • Continuous Tasks: These are tasks that last forever (no termination state). In this case, the agent must learn how to choose the best action and interact with the environment at the same time. For example, an agent that conducts automated stock trading. For this task, there are no start and end states. The proxy runs until we decide to block him.



[More to come ...]

Document Actions