Lecture 19

Mora randomness. Mean and standard deviation. Simulations that use randomness.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# numpy array with uniform random numbers
np.random.uniform(low=0, high=1, size=9)
Out[2]:
array([ 0.06338181,  0.02569077,  0.46695246,  0.51746962,  0.41736873,
        0.76491016,  0.34586409,  0.79547829,  0.27020484])
In [3]:
# numpy array with Gaussian (normal) random numbers
np.random.normal(loc=0.0, scale=1.0, size=9)
Out[3]:
array([ 1.51366209, -1.13783313,  0.187717  ,  0.11261519,  0.02076459,
       -1.54907158, -0.70236171,  0.68117334,  0.43831202])
In [4]:
N = 5000
mu = 0.0
sigma = 1.0  # highers standard dev
X = np.random.normal(loc=mu, scale=sigma, size=N)
plt.axis([-6, 6, 0, N/6])
_,_,_ = plt.hist(X, 30)

Choice: chooses a random element of a list

In [5]:
import random
words = ["love", "hate", "tender", "care", "deep"]
random.choice(words)
Out[5]:
'hate'
In [6]:
# random poem:
poem = []
poem.extend([random.choice(words) for _ in range(3)])
poem.append("\n")
poem.extend([random.choice(words) for _ in range(3)])
poem.append("\n")
poem.extend([random.choice(words) for _ in range(4)])
poem.append("\n")
poem.extend([random.choice(words) for _ in range(3)])
print(" ".join(poem))
love care tender 
 tender hate care 
 hate love care hate 
 love care care

Mean and Standard Deviation of a Data-set

Say I am given a numpy array X full of numbers (e.g. grades).

In [7]:
X = np.array([67, 62, 78, 67, 64, 52, 50, 80, 50, 94, 77, 62, 78, 67, 44, 52, 70, 80, 50, 94, 100, 61, 59, 56, 30, 91, 60, 54, 34, 98])
In [8]:
print("shape:", X.shape)
# grade range:
min(X), max(X)
shape: (30,)
Out[8]:
(30, 100)
In [9]:
plt.scatter(X, np.zeros(30)+1)
Out[9]:
<matplotlib.collections.PathCollection at 0x105aadcc0>
In [10]:
_, _, _ = plt.hist(X,6)

Question: If we knew that these numbers came from a normal distribution $N(\mu, \sigma)$, what is the most likely normal distribution that this data would have come from?

Answer: It's the normal distribution $N(\mu, \sigma)$ where $\mu$ is the mean of the data, and $\sigma$ is the standard deviation of the data. Mean is the average of the data, standard deviation is the square root of the average of the square of the distance to the mean of the data... better to write down the formula:

$$\mu = \frac{1}{N}\sum_{i=1}^N x_i$$$$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2}$$

In python, we can use np.mean and np.std to compute these.

In [11]:
mu, sigma = np.mean(X), np.std(X)
print("mean: ", mu)
print("standard deviation: ", sigma)
mean:  66.0333333333
standard deviation:  17.8390644996

Recall that the Gaussian distribution $N(\mu, \sigma)$ has density function:

$$ p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp({-\frac{1}{2}(\frac{x}{\sigma})^2})$$

We will now draw the bell curve:

In [12]:
from math import sqrt, pi, e

Z = np.linspace(0,100,300)
guassian_func = lambda mu, sigma: lambda x: 1/(sqrt(2*pi)*sigma) * e**(-0.5*(x - mu)*(x-mu)/(sigma * sigma))
plt.plot(Z, guassian_func(mu, sigma)(Z))
_, _, _ = plt.hist(X,6, normed=True)

Conclusion: For any data-set, the mean and standard deviation of the best-fitting Gaussian distribution are the mean and standard deviation of the data-set.

Random Simulation, modeling a real-life situation:

The bus from my street to work is scheduled at 8:00 A.M. I measured that the bus' arrival time has a Gaussian (normal) distribution with mean 8:05 and a standard deviation of 4 minutes.

If I miss the bus, the bus after that will come 5 minutes after the first bus (no delays).

When should I get to the bus stop to minimize my total expected waiting time? (I don't care how late I am. I just don't want to wait for the bus at the stop. I'm weird like that.)

Let's simulate

In [13]:
import random
time_till_next_bus = 50
mu = 0  # 8:05 is time 0
sigma = 4
N = 10000  # number of simulations

arrs = []

for my_arrival_at_stop in range(-20,20):
    sum_of_waiting_times = 0.0
    for i in range(N):
        bus_arrival = random.gauss(mu, sigma)
        if my_arrival_at_stop < bus_arrival:
            sum_of_waiting_times += bus_arrival - my_arrival_at_stop
        elif bus_arrival + time_till_next_bus - my_arrival_at_stop > 0:
            sum_of_waiting_times += bus_arrival + time_till_next_bus - my_arrival_at_stop
        # if i missed the next bus too, don't count that one.
            
    print("my arrival:", my_arrival_at_stop, "simulated total waiting time", sum_of_waiting_times / N)
    arrs.append(sum_of_waiting_times / N)
    
plt.plot(range(-20,20), arrs)
    
my arrival: -20 simulated total waiting time 19.99035262122956
my arrival: -19 simulated total waiting time 19.006380395960736
my arrival: -18 simulated total waiting time 17.96536688512
my arrival: -17 simulated total waiting time 17.04323664114401
my arrival: -16 simulated total waiting time 16.021217070209193
my arrival: -15 simulated total waiting time 15.016959985637016
my arrival: -14 simulated total waiting time 14.024623255701556
my arrival: -13 simulated total waiting time 13.04228621248957
my arrival: -12 simulated total waiting time 12.095693060793126
my arrival: -11 simulated total waiting time 11.150954594514037
my arrival: -10 simulated total waiting time 10.357221933491182
my arrival: -9 simulated total waiting time 9.572561593983623
my arrival: -8 simulated total waiting time 9.095218975098513
my arrival: -7 simulated total waiting time 8.920854581070959
my arrival: -6 simulated total waiting time 9.260533411024028
my arrival: -5 simulated total waiting time 10.237636843922752
my arrival: -4 simulated total waiting time 12.080221927039274
my arrival: -3 simulated total waiting time 14.612031800193757
my arrival: -2 simulated total waiting time 17.199965312692694
my arrival: -1 simulated total waiting time 21.327200927230617
my arrival: 0 simulated total waiting time 25.312694491475554
my arrival: 1 simulated total waiting time 28.951953184928552
my arrival: 2 simulated total waiting time 32.59868175679879
my arrival: 3 simulated total waiting time 35.959521876228514
my arrival: 4 simulated total waiting time 38.28838346528995
my arrival: 5 simulated total waiting time 39.94433006914123
my arrival: 6 simulated total waiting time 40.731475242737176
my arrival: 7 simulated total waiting time 41.092404160854194
my arrival: 8 simulated total waiting time 40.71247971701636
my arrival: 9 simulated total waiting time 40.380270536540195
my arrival: 10 simulated total waiting time 39.67148897265139
my arrival: 11 simulated total waiting time 38.82569531144243
my arrival: 12 simulated total waiting time 37.96997328930819
my arrival: 13 simulated total waiting time 36.95804822786831
my arrival: 14 simulated total waiting time 36.0102628024104
my arrival: 15 simulated total waiting time 35.024224712680095
my arrival: 16 simulated total waiting time 34.01024882501717
my arrival: 17 simulated total waiting time 32.92377365528235
my arrival: 18 simulated total waiting time 32.00770144318245
my arrival: 19 simulated total waiting time 30.990704378340446
Out[13]:
[<matplotlib.lines.Line2D at 0x1057e7780>]