# Currently, one of the most explored and favoured machine learning techniques by the biggest and brightest AI brains, Reinforcement Learning is a term known to practically everyone working in the field of AI.

Currently, one of the most explored and favoured machine learning techniques by the biggest and brightest AI brains, Reinforcement Learning is a term known to practically everyone working in the field of AI. The process of learning by reinforcement is by itself a strong sign of intelligence that we humans can easily relate to. We have already discussed reinforcement learning with a very popular algorithm called Thompson Sampling in one of our previous articles.

Meanwhile, feel free to check out our latest hackathon in Machinehack – Predict The Cost Of Used Cars – Hackathon By Imarticus. The hackathon is conducted in partnership with Imarticus Learning . Participate now and win exciting prizes.

In this article, we will explore another popular algorithm that implements reinforcement learning called the Upper Confidence Bound or UCB. What is UCB

Unlike Thompson sampling that we discussed in one of our previous articles which is a probabilistic algorithm meaning that the success rate distribution of the bandits was calculated based on the probability distribution. UCB is a deterministic algorithm meaning that there is no factor of uncertainty or probability.

We will use the same MultiArmed Bandit Problem to understand UCB. If you are not familiar with the Mult-Armed Bandit Problem(MABP), please go ahead and read through the article – The Intuition Behind Thompson Sampling Explained With Python Code .

UCB is a deterministic algorithm for Reinforcement Learning that focuses on exploration and exploitation based on a confidence boundary that the algorithm assigns to each machine on each round of exploration. (A round is when a player pulls the arm of a machine)

We will try to understand UCB as simple as possible. Consider there are 5 bandits or slot machines namely B1, B2, B3, B4 and B5.

### Thank you for reading this post, don't forget to subscribe to our AI NAVIGATOR!

 Email address First Name Last Name Subscription Preferences Featured News - Weekly Sunday AI News AI Venture Insights AI Navigators - Research Paper AI Online Events AI Advisory for Organisation

Given the 5 machines, using UCB we are going to devise a sequence of playing the machines in a way to maximize the yield or rewards from the machines.

Given below are the intuitive steps behind UCB for maximizing the rewards in a MABP:

Step 1: Each machine is assumed to have a uniform Confidence Interval and a success distribution. This Confidence Interval is a margin of success rate distributions which is the most certain to consist of the actual success rate distribution of each machine which we are unaware of in the beginning.

Step 2: A machine is randomly chosen to play, as initially, they have all the same confidence Intervals.

Step 3: Based on whether the machine gave a reward or not, the Confidence Interval shifts either towards or away from the actual success distribution and the also converges or shrinks as it has been explored thus resulting in the Upper bound value of the confidence Interval to also be reduced.[…]