Currently, one of the most explored and favoured techniques by the biggest and brightest brains, Reinforcement Learning is a term known to practically everyone working in the field of .
Currently, one of the most explored and favoured techniques by the biggest and brightest brains, Reinforcement Learning is a term known to practically everyone working in the field of . The process of learning by reinforcement is by itself a strong sign of intelligence that we humans can easily relate to. We have already discussed reinforcement learning with a very popular algorithm called Thompson Sampling in one of our previous articles.
Meanwhile, feel free to check out our latest hackathon in Machinehack – Predict The Cost Of Used Cars – Hackathon By Imarticus. The hackathon is conducted in partnership with Imarticus Learning . Participate now and win exciting prizes.
In this article, we will explore another popular algorithm that implements reinforcement learning called the Upper Confidence Bound or UCB. What is UCB
Unlike Thompson sampling that we discussed in one of our previous articles which is a probabilistic algorithm meaning that the success rate distribution of the bandits was calculated based on the probability distribution. UCB is a deterministic algorithm meaning that there is no factor of uncertainty or probability.
We will use the same MultiArmed Bandit Problem to understand UCB. If you are not familiar with the Mult-Armed Bandit Problem(MABP), please go ahead and read through the article – The Intuition Behind Thompson Sampling Explained With Python Code .
UCB is a deterministic algorithm for Reinforcement Learning that focuses on exploration and exploitation based on a confidence boundary that the algorithm assigns to each machine on each round of exploration. (A round is when a player pulls the arm of a machine)
We will try to understand UCB as simple as possible. Consider there are 5 bandits or slot machines namely B1, B2, B3, B4 and B5.
Given the 5 machines, using UCB we are going to devise a sequence of playing the machines in a way to maximize the yield or rewards from the machines.
Given below are the intuitive steps behind UCB for maximizing the rewards in a MABP:
Step 1: Each machine is assumed to have a uniform Confidence Interval and a success distribution. This Confidence Interval is a margin of success rate distributions which is the most certain to consist of the actual success rate distribution of each machine which we are unaware of in the beginning.
Step 2: A machine is randomly chosen to play, as initially, they have all the same confidence Intervals.
Step 3: Based on whether the machine gave a reward or not, the Confidence Interval shifts either towards or away from the actual success distribution and the also converges or shrinks as it has been explored thus resulting in the Upper bound value of the confidence Interval to also be reduced.[…]