Monte Carlo Method

Monte Carlo (MC)?

Before I introduce MC, you might have heard lots of terms having Monte Carlo in it, such as:

Monte Carlo Method
Monte Carlo Experiment
Monte Carlo Simulation
Monte Carlo Control
Monte Carlo Tree Search (MCTS)
Monte Carlo Algorithm

Although these words look similar, only the first three terms refer to the same thing, which efficiently generate an ensemble (numerical results).

Monte Carlo Control and Monte Carlo Tree Search are the methods often used for Reinforcement learning (an optimization method, we will talk about that in next article), and the Monte Carlo Algorithm is a totally different thing — we will not talk about it here.

Monte Carlo Method

The goal of MC method is to approximate a certain value by randomly sampling from a given system.

Application and key concept

MC method can be used in various kind of problems, including pi approximation in Mathematics, ising model energy approximation in Statistical Thermodynamics or any probability distribution approximation. The key concept of Monte Carlo Method is fairly easy:

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. - Wikipedia

For a given system, you randomly sample a value for certain amount of times (let’s say 10K times) from the system. By simply taking the average of obtained results, the expected value of the target can be approximated by taking the mean of the obtained results.

Let’s take some hand-on python examples to better understand how it works.

Example 1: One Armed Bandit

Let’s suppose you are playing an armed bandit machine — which you should pay money to win the money with a certain probability.
To make sure you can win the money, you want to know the expected value of earning money.

First, we set up an armed bandit machine with

cost = 500 (pay 500 each time you play)
p = 0.2 (there is 20% probability that you win the game)
reward = 1000 (you get 1000 if you win the game)

Thus, the expected value of this machine is:

v = p x reward - cost = -300

However, we don’t know the probability of winning the game, so we should do sampling to find out the p (or v).

import numpy as np
from matplotlib import pyplot as plt

class ArmedBandit:
    def __init__(self, p = 0.2, cost = 500, reward = 1000):
        self.p = p
        self.cost = cost
        self.reward = reward
        self.value = p*reward - cost

    def pull(self):
        if np.random.random() <= self.p:
            reward = self.reward
        else:
            reward = 0
        return reward - self.cost

Now, we do monte calro simulation, that is, sample (play) the armed bandit machine for 1K times, with starting money 100K.

Bandit = ArmedBandit()

# Monte Carlo Method
trail_n = 1000
init_money = 100000
total_earn = 0

remain_moneys = []
expected_values = []

remain_money = init_money 
for i, trail in enumerate(range(trail_n)):
    i += 1
    earn = Bandit.pull()
    remain_money += earn
    remain_moneys.append(remain_money)
    total_earn += earn
    expected_value = total_earn/i
    expected_values.append(expected_value)
print ('Remain money: %s' % remain_money)
print ('Expected value of this bandit: %s' % expected_value)
print ('Real value of this bandit: %s' % Bandit.value)

plt.plot(remain_moneys)
plt.title('remaining money')
plt.show()

plt.plot(expected_values)
plt.title('expected value of armed bandit')
plt.show()

We see that we can get a pretty good expected value after 10K times of playing. However, it took us almost 300K to get this result.

Example 2: Multi-Armed Bandits

In the previous game, you are playing an armed bandit with negative expected value — which you don’t want to play to lose money.</br>

Now, you are playing an advanced version of game — multi-armed bandits. There are multiple (here I set 10) arms you can play and each arm has different probability of winning money.
Here, your strategy is to

Approximate the value of each arm at the first 10K trails (so you play each arms for 1K times) by Monte Carlo Method.
After getting the expected value of each arm, you play the arm with the highest expected value for another 10K more times to win the most money.</br>

class MultiArmedBandit:
    def __init__(self, k = 10, cost = 500, reward = 1000):
        probs = [p for p in np.random.random(10)]
        self.values = [p*reward - cost for p in probs]
        self.Bandits = [ArmedBandit(p, cost, reward) for p in probs]

Bandits = MultiArmedBandit()

# Monte Carlo
trail_n = 10000
init_money = 100000
k = 10
expect_values = []
remain_money = init_money 
remain_moneys = []
for bandit in range(k):
    this_earn = 0
    Bandit = Bandits.Bandits[bandit]
    for trail in range(int(trail_n/k)):
        earn = Bandit.pull()
        this_earn += earn
        remain_money += earn
        remain_moneys.append(remain_money)
    expect_values.append(this_earn/int(trail_n/k))

print ('Remain money: %s' % remain_money)

plt.plot(remain_moneys)
plt.title('remaining money')
plt.show()

plt.bar(range(10), expect_values)
plt.title('expected value of each armed bandit')
plt.show()

Where the real value of each bandit is:

After 10K times of playing, we appoximate the expect value of each armed bandit, which is pretty close to the real expected value (good job!) but it took more than 400K during the process.
Now, we only play on the armed badit with highest expected value (bandit 10) to ensure we can win the most money for 10K more times.

trail_n = 10000

Bandit = Bandits.Bandits[np.argmax(expect_values)]
for trail in range(trail_n):
    earn = Bandit.pull()
    remain_money += earn
    remain_moneys.append(remain_money)

print ('Remain money: %s' % remain_money)

plt.plot(remain_moneys)
plt.show()

We now earned around 3M in 10K plays by knowing which bandit can give the best rewards!

With this approach, we earned 3.1M in this 20K plays! Congrats!

Monte Carlo Control

However, there is a way to efficiently sample the valuable armed bandit during playing the bandits without knowing the expectation value of each bandit. With this method, we earned nearly 4M after playing 20K times without lossing any money during the process:

This method is called Monte Carlo Control (with greedy policy) — an advanced method for optimization which we will talk about in the next article.