Inverse Problems and MCMC

Long story short: I had to do some work using a modified version of Markov Chain Monte Carlo [1] (MCMC) called “Informed Sampling”. For the casual reader I’d like to break down the basics of what type of problem you can solve using the overall approach [2] and maybe in a later post I will explain what makes informed sampling different from standard MCMC.

 

Setting the Scene

19th century London. You are a sleuth, and you are currently working to track down a smuggler of all sorts of goods. The goal is to find his base of operations, because then you can build a case against him and his collaborators.

You receive an anonymous tip that this individual is meeting someone at the city center in 10 mins. The center of the city is marked by a large monument and surrounded by a circular park. The question is: Can you make it in time? It is exactly 1 kilometer to the center, and you know you can walk a kilometer in 9,5 minutes. Therefore, it will take you 9,5 mins to reach the center. You can make it.

 

The forward problem

This is what we call a forward problem. You know all the parameters of the world and you are interested in predicting an outcome. In this special case, we knew the distance we want to travel and the speed at which we can travel. From this, we are able to predict the time it will take:

(1)    \begin{eqnarray*} t&=&l \cdot v \\ &=&9.5, \end{eqnarray*}

where l is the 1 kilometer distance and v is your walking speed.

Once you reach the center of the park you can see the smuggler talking with another person! You strategically pick a place to sit nearby and eavesdrop on the conversation. You overhear him mention that it took him 10 mins to walk to this place. From the information provided, can we infer the location of his base-of-operation?  We know that it took the smuggler about the same time to reach the same point. Does that mean this person is your neighbour?

 

The Inverse Problem

So, can we tell if this person is our neighbour? No.

Very often we don’t have enough information to completely solve inverse problems. We say they are ill posed. In this case we we don’t know how fast the person really walks and, even if we did, the solution is not unique.

However, if we want to find a solution to this incomplete problem, we need to leverage assumptions and prior knowledge. For example, let’s say we take an educated guess that, based on their build, this individual walks about the same speed as us. We formulate it mathematically by saying they walk a kilometer in (10±1) minutes.

 

Solving the Inverse Problem with MCMC

We define the forward problem as

(2)    \begin{eqnarray*} F(x_1, x_2, v)&=& v \cdot \sqrt{x_1^2 + x_2^2} \\ &=&t \end{eqnarray*}

where (x1, x2) is a given 2D starting location and v is the walking speed. We can then propose (randomly guess) a set of candidate points and run them through the forward problem. Then we see if the solution makes sense by comparing the result to our observation, 10 min. If the result matches our observation, we’ll save that candidate starting location as a plausible solution. We assume we are completely ignorant about their starting location, so we pull our random guesses from a uniform distribution across the entire city (2 kilometers x 2 kilometers).

We’ve already estimated this person walks about as fast as us. So, for each proposed starting location, we’ll also have to generate a walking speed and we’ll sample this from a Normal distribution.

Lastly, we don’t know if it took them exactly 10 mins to walk. Typically, people tend to round times to give an number divisible by 5 in conversation (e.g. 13.2 minutes becomes 10 minutes), so we should incorporate this into our model. This is an “observational error”.

All of this taken together is the foundation for our Bayesian model.

 

x1 ~ Uniform(-2, 2)         # sample starting location in x1
x2 ~ Uniform(-2, 2)         # sample starting location in x2
s ~ Normal(9.5, .5)         # sample walking speed
t = F(s, x1, x2)            # run forward problem to get predicted time
y ~ Normal(t, 3/2)          # predicted time considering observational error

To do anything useful with it we use MCMC. For python, we have a a lot of options to choose from different pre-built libraries or build our own; For this little example we use numpyro and the NUTS MCMC sampler.

 

def model(obs):
    x1 = numpyro.sample('x1', dist.Uniform(-2, 2))
    x2 = numpyro.sample('x2', dist.Uniform(-2, 2))
    s = numpyro.sample('s', dist.Normal(9.5, .5))
    t = F(x1, x2, s)
    return numpyro.sample('obs', dist.Normal(t, 1.5), obs=obs)


kernel = NUTS(model)
mcmc = MCMC(kernel, NUM_WARMUP, NUM_SAMPLES)
mcmc.run(rng_key_, obs=np.array([20]))
mcmc.print_summary()

We run our model and we get the following results:

 

        mean       std    median      5.0%     95.0%     n_eff     r_hat
 S     19.48      0.50     19.48     18.68     20.32   8971.39      1.00
X1      0.00      0.73      0.00     -1.03      1.04   5148.41      1.00
X2     -0.02      0.73     -0.03     -1.05      1.02   5363.95      1.00

The output indicates that the average inferred starting location is (0,0) …basically the location we’re standing. How does this make sense? Let’s take a look at the actual samples to understand this more.

 

All possible locations are a ring with radius ~ 1 kilometer – if we average all of these we get the center of the ring, (0,0).

 

Obtaining more information

Obviously, with this information we still can’t tell whether this person is our neighbour or not – they could live anywhere along this ring.

As you’re eavesdropping some more on this conversation, you notice they are holding a sandwich bag. You know this deli – it’s a popular local spot. The people in the immediate neighbourhood go to it all the time, but only locals know of it . However, if someone from that neighbourhood moves across the city, they’ll still visit to get their fix. This is useful information to us because it provides some more information on the positional variables.

We can use this information in our model by updating our prior guess on their starting location. We will assume a Laplacian distribution based on what we know about the deli.

Our model is now:

 

x1 ~ Laplacian(-1.5, .1)    # sample starting location in x1
x2 ~ Laplacian(1, .1)       # sample starting location in x2
s ~ Normal(9.5, .5)         # sample walking speed
t = F(s, x1, x2)            # run forward problem to get predicted time
y ~ Normal(t, 3/2)          # predicted time considering observational error

We run the model again we get the following samples:

 

Drawing conclusions

You can qualitatively tell that more information is needed to improve our model; but, for the sake of this example, we’ll stop here. What is the take-away? What can we do with the results that we have?

The most probable location is not the average of these samples (remember how that lead us astray in our first attempt?). Instead, we need to find the point with the highest density of accepted starting locations, the mode. To do this we’ll approximate the samplings as a probability distribution using a Gaussian kernel density estimation. We can then find the maximum point using gradient descent.

 

from scipy.optimize import minimize
from scipy.stats import gaussian_kde

kde = gaussian_kde(np.vstack([x1, x2]))
minimize(lambda x: -kde(x), [-1, 0])
x: array([-0.74703862, 0.76423531])

We have now identified the highest probable location of the smuggler’s den based on our assumptions.

 

What is the probably they live near you?

If we wanted to answer the original question: “is this person is our neighbour?” we can find the probability of this given our MCMC samples. We’ll define “our neighborhood” as ±0.5 km around our house.

We can find this very easily using the samples obtained from our MCMC model. The probability is the number of samples inside our neighborhood, divided by the total number of samples.

 

P = np.mean((x1 < -.5) & (x1 > -1.5) & (x2 < .5) & (x2 > -.5))
print(f"Probability the kingpin is your neighbor: {100*P:.2f}%")
Probability the smugglers den is in our neighbourhood: 22.68%

 

Decision Making

Okay this was a lot of math and assumptions to try to answer our question, but how can we use this to inform our actions?

A viable way is to distribute your resources given the information. For example how much time should we spend looking for the base of operations in our neighbourhood? Well, if we have 5 days to find him, we should only spend 22.68% of it in our neighborhood, or about 1 day and 3 hours.


References

  1. Ever wondered why it’s called “Monte Carlo” Methods? Yep, it’s because of the famous casinos and gambling.
  2. Obviously this is a retake of the fantastic article by Ben Hammel, I couldn’t resist giving it my own simplified twist, but Ben really did most of the original work and much more. You should check out his work!
Are Deep Neural Networks Dramatically Overfitted?

If you are like me, entering into the field of deep learning with experience in traditional machine learning, you may often ponder over this question: Since a typical deep neural network has so many parameters and training error can easily be perfect, it should surely suffer from substantial overfitting. How could it be ever generalized to out-of-sample data points?

Tremendous piece by Lilian Weng about deep learning. If you ever wondered why and how deep learning works without hopelessly overfitting this article is for you. Includes a lot of references to interesting and current research.

Swift for Tensorflow

Swift for TensorFlow is a next-generation platform for machine learning, incorporating the latest research across: machine learning, compilers, differentiable programming, systems design, and beyond. This project is at version 0.2; it is neither feature-complete nor production-ready. But it is ready for pioneers to try it for your own projects, give us feedback, and help shape the future!

This is truly exciting, and sooner or later I have to brush up on my swift skills. Take a look at the video above and see what the future of ML research holds. Automatic differentiation looks amazing.
Robots in Space

The term “Artificial Intelligence (AI)” comprises all techniques that enable computers to mimic intelligence, for example, computers that analyse data or the systems embedded in an autonomous vehicle. Usually, artificially intelligent systems are taught by humans — a process that involves writing an awful lot of complex computer code.

Nice little overview of the current initiatives and projects regarding Artificial Intelligence in Space at ESA. Lot’s of interesting stuff going on there. I am personally super excited about autonomous robots and using deep learning for navigation and docking of spacecraft.