An Introduction to Deep Reinforcement Learning

Deep reinforcement learning is a promising combination between two artificial intelligence techniques: reinforcement learning, which uses sequential trial and error to learn the best action to take in every situation, and deep learning, which can evaluate complex inputs and select the best response.

There are frameworks and tools available for deep reinforcement learning, but while they are very successful in closed environments like video games, using them to learn and react to real-world situations is more challenging. We’ll explain the mechanics of reinforcement learning and deep reinforcement learning, and cover some real business problems it can solve.

What is Reinforcement Learning ?

Reinforcement learning is a goal-oriented algorithm that learns by trial and error. It is different from both supervised and unsupervised machine learning. While supervised learning can predict labels for complex inputs, and unsupervised learning can group together related items, reinforcement learning predicts the action that will yield the best result.

The “reinforcement” part of reinforcement learning means that algorithms are rewarded or punished for the actions they take. The algorithm attempts to maximize a function that evaluates the immediate and future rewards of taking one of several possible actions. Rewards are “discounted” as they extend into the future, to encourage the algorithm to find actions that yield short-term results vs. those that only pay off in the long term.

Reinforcement learning is a very general framework that can be applied to just about any problem. Because of its generality and dynamic nature, it requires a simulation of a real environment to train and learn━it is less well-understood than other machine learning techniques. It is only starting to be used in industry applications.

Deep Learning vs Reinforcement Learning

Deep learning analyses a training set, identifies complex patterns and applies them to new data. A classic application is computer vision, where Convolutional Neural Networks (CNN) break down an image into features and analyze them to accurately classify the image. Reinforcement learning works sequentially in an unknown environment-taking an action, evaluating the rewards, and adjusting the following actions accordingly.

Deep learning and reinforcement learning complement each other:

Reinforcement learning algorithms manage the sequential process of taking an action, evaluating the result, and selecting the next best action. However, they need a good mechanism to select the best action based on previous interactions.
Deep learning can be that mechanism and it is the most powerful method available today to learn the best outcome based on previous data.

Deep Reinforcement Learning (DRL) is a technology that combines the two, creating a sequential reinforcement learning process, in which deep learning determines the action taken at every stage.

Reinforcement Learning Basic Concepts

The reinforcement learning framework provides a formal structure that defines how an agent decides which actions to take, and how it learns from its environment.

The following equation shows how Q is evaluated in a reinforcement learning model:

What Is Deep Reinforcement Learning: Value-Based and Policy-Based Learning ?

In deep reinforcement learning, each state is represented by an image. This could be, for example:

One frame in a video game, where the elements on the screen represent the state.
The current scene viewed by a robot

Based on these images, which provide information about the agent’s context, the agent must select an action. In the video game, this would be moving up, down, left, right, etc. A robot can select where to extend its hand or where to move next.

The Deep Reinforcement Learning Process: Value-Based Method

Algorithms such as Deep-Q-Network (DQN) use Convolutional Neural Networks (CNNs) to help the agent select the best action.

While these algorithms are very complex, these are typically the basic steps:

Take the image representing the state, convert it to grayscale, and crop unnecessary parts.
Run the image through a series of convolutions and pooling to extract the essential features that can help the agent make the decision.
Calculate the Q-Value of each possible action.
Perform back-propagation to find the most accurate Q-Values.

The Deep Reinforcement Learning Process: Policy-Based Method

n the real world, the number of possible actions can be very high or unknown. For example, a robot learning to walk on open terrain could have millions of possible actions within the space of a few minutes. In these environments, calculating Q-values for each action is not feasible.

Policy-based methods learn the policy function directly, without calculating a value function for each action. An example of a policy-based algorithm is Policy Gradient.

Policy Gradient, simplified, works as follows:

1. Takes in a state and gets the probability of each action based on previous experience

2. Selects the most probable action

3. Repeats until the end of the game and evaluates the total rewards

4. Updates the parameters in the network, based on the rewards, using backpropagation

This way, the network allows the agent to play freely, but with every successive game, it provides better probabilities for actions that will lead the agent to a positive result.

Deep Reinforcement Learning Applications

Deep reinforcement learning has been very successful in closed environments like video games, but it is difficult to apply to real-world environments. Reinforcement learning is data inefficient and may require millions of iterations to learn simple tasks. There are major gaps between simulated and real environments that make it difficult to train models. Some organizations opt for a deep learning platform to help them implement their DRL projects.

Here are a few examples of attempts to use DRL technology to solve business challenges:

Robotics

Google published the Soft Actor Critic algorithm, which helps robots use reinforcement learning to learn real-world tasks, without requiring a large number of attempts, and while safeguarding the robot from taking actions that could cause damage. The algorithm was successful in training an insect-like robot to walk, and training a robot hand to carry out simple tasks in a matter of hours.

Healthcare Applications

Reinforcement learning can be applied to historical medical data to see which treatments resulted in the best results, and help predict the best treatment for current patients. For example, deep reinforcement learning was used to predict drug doses for sepsis patients, for finding optimal dose cycles for chemotherapy, and selecting dynamic treatment regimes combining hundreds of possible medications based on medical registry data.

Chemistry

Deep reinforcement learning has been used to optimize chemical reactions. A reinforcement learning agent optimized a sequential chemical reaction, predicting at every stage of the experiment which is the action that would generate the most desirable chemical reaction. DRL outperformed a state-of-the-art algorithm used to conduct the same experiment.