Pytorch on Apple Silicon

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac.

It’s about time. It has been 18 months since the first M1 chips shipped and finally we got some support for the advanced GPUs on these chips. Being able to leverage your own local machine for simple Pytorch tasks and even just doing local testing of bigger projects is of tremendous value. One thing I am still confused about is that all of this technology still just uses the ‘standard’ GPU cores and there is still no way to access the custom Neural Engine of the chip. This would make inference on device even better. But it’s up to Apple to give access to their internal APIs.

Designing a Poster Experience in the times of COVID-19

Different Times

I spent the last few months working as a research assistant in artificial intelligence at the Frankfurt Institute for Advanced Studies. With the global COVID-19 situation almost all parts of the scientific workflow went digital. I was working from home, logging into the office computers and high performance clusters via SSH, but all in all this was not that much different than programming and logging on from the office. One thing that has changed significantly, though, is conferences.

As large gatherings of people are potential COVID hotspots it totally makes sense for conferences to go virtual, … at least for this year. And there is tremendous potential in having virtual events: Talks get recorded more frequently, allowing for a more flexible schedule; pre-recorded talks make sure that everyone keeps within their time-slot; it’s more inclusive for people who otherwise cannot afford to attend and last but definitely not least it’s better for global climate for not having people fly all over the world.

I firmly believe that the academic world should learn from this special situation and incorporate some of the up-sides of having virtual events into future conferences: Maybe an alternating schedule (virtual/in-person) would be a good idea. But right now I feel everything is going to go back to as it was before.

Why do I think this…? Well, I think people enjoy being with each other and we have not figured out how to do the networking part conferences yet. For me it becomes most apparent with poster-sessions. Sure you cannot drink a beer at bar with your colleagues when you are stuck behind your webcam at your desk, but that was never going to be. The poster-session format is partly social interaction with a presenter and it fails badly at virtual conferences, and actually there is no real reason why.

Fixing Virtual Interaction

So there might be conferences that actually do what I suggest next, so I apologize beforehand, but the three conferences I attended this year either scrubbed the poster session and made posters downloadable or made them short 5 min prerecorded talks. I understand the notion that this is the easiest way but I am kind of disappointed that we have not figured out something better.

In my opinion there’s actually two sides to the problem: conceptualization and software.

You have to think about what makes a poster-session a poster-session. How does it work and why does it work? Inspire from the real world and take it to the virtual one.

Start with a showcase. There needs to be a website where you can visually scroll through all of the posters. This is a no-brainer: don’t just go with abstracts and links and no, it doesn’t have to be a fancy 3D virtual floor. Just make it a long scrollable website with BIG images. People have spent time designing these posters, this is what catches our attention at the poster session. Have little hearts or thumbs-up icons next to the images to indicate whether some poster creates enthusiasm among the participants.

When you click on a poster you should be pulled into a virtual conference room, like a zoom call, where the presenter talks about her poster and answers questions. Now here comes the software part: Simple screen sharing is not enough, we need a simplified powerpoint for posters/documents. Conceptually it should be like prezi, because the poster is the whole canvas, but there should be simple tools that analyze the PDF and make it easy to highlight regions like boxes and round-rects and of course you should be able to pan and zoom around. Moreover it should not be a video-stream! Something like this needs to be rendered client-side. These posters are smaller than your average hd video and they are vector-based, making them effectively resolution independent. With client-side rendering the participants can decide for themselves on which parts of the canvas they would like to focus.

To end this little blogpost I want to showcase what I came up with for this year’s ESANN conference. Let me be clear, this was way more work than a simple poster. But given the right software it should be feasable to transform any PDF into something where you can zoom around and highlight passages in a similar manner [1].

On a little side-note I’m still on the lookout for a cool PhD position, so please get in touch if you want to work with me!

[1] If you liked that video, you can of course take a look at the corresponding paper:

Ernst M.R., Triesch J., Burwick T. (2020). Recurrent Feedback Improves Recognition of Partially Occluded Objects. In Proceedings of the 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN)

Inverse Problems and MCMC

Long story short: I had to do some work using a modified version of Markov Chain Monte Carlo [1] (MCMC) called “Informed Sampling”. For the casual reader I’d like to break down the basics of what type of problem you can solve using the overall approach [2] and maybe in a later post I will explain what makes informed sampling different from standard MCMC.

 

Setting the Scene

19th century London. You are a sleuth, and you are currently working to track down a smuggler of all sorts of goods. The goal is to find his base of operations, because then you can build a case against him and his collaborators.

You receive an anonymous tip that this individual is meeting someone at the city center in 10 mins. The center of the city is marked by a large monument and surrounded by a circular park. The question is: Can you make it in time? It is exactly 1 kilometer to the center, and you know you can walk a kilometer in 9,5 minutes. Therefore, it will take you 9,5 mins to reach the center. You can make it.

 

The forward problem

This is what we call a forward problem. You know all the parameters of the world and you are interested in predicting an outcome. In this special case, we knew the distance we want to travel and the speed at which we can travel. From this, we are able to predict the time it will take:

(1)    \begin{eqnarray*} t&=&l \cdot v \\ &=&9.5, \end{eqnarray*}

where l is the 1 kilometer distance and v is your walking speed.

Once you reach the center of the park you can see the smuggler talking with another person! You strategically pick a place to sit nearby and eavesdrop on the conversation. You overhear him mention that it took him 10 mins to walk to this place. From the information provided, can we infer the location of his base-of-operation?  We know that it took the smuggler about the same time to reach the same point. Does that mean this person is your neighbour?

 

The Inverse Problem

So, can we tell if this person is our neighbour? No.

Very often we don’t have enough information to completely solve inverse problems. We say they are ill posed. In this case we we don’t know how fast the person really walks and, even if we did, the solution is not unique.

However, if we want to find a solution to this incomplete problem, we need to leverage assumptions and prior knowledge. For example, let’s say we take an educated guess that, based on their build, this individual walks about the same speed as us. We formulate it mathematically by saying they walk a kilometer in (10±1) minutes.

 

Solving the Inverse Problem with MCMC

We define the forward problem as

(2)    \begin{eqnarray*} F(x_1, x_2, v)&=& v \cdot \sqrt{x_1^2 + x_2^2} \\ &=&t \end{eqnarray*}

where (x1, x2) is a given 2D starting location and v is the walking speed. We can then propose (randomly guess) a set of candidate points and run them through the forward problem. Then we see if the solution makes sense by comparing the result to our observation, 10 min. If the result matches our observation, we’ll save that candidate starting location as a plausible solution. We assume we are completely ignorant about their starting location, so we pull our random guesses from a uniform distribution across the entire city (2 kilometers x 2 kilometers).

We’ve already estimated this person walks about as fast as us. So, for each proposed starting location, we’ll also have to generate a walking speed and we’ll sample this from a Normal distribution.

Lastly, we don’t know if it took them exactly 10 mins to walk. Typically, people tend to round times to give an number divisible by 5 in conversation (e.g. 13.2 minutes becomes 10 minutes), so we should incorporate this into our model. This is an “observational error”.

All of this taken together is the foundation for our Bayesian model.

 

x1 ~ Uniform(-2, 2)         # sample starting location in x1
x2 ~ Uniform(-2, 2)         # sample starting location in x2
s ~ Normal(9.5, .5)         # sample walking speed
t = F(s, x1, x2)            # run forward problem to get predicted time
y ~ Normal(t, 3/2)          # predicted time considering observational error

To do anything useful with it we use MCMC. For python, we have a a lot of options to choose from different pre-built libraries or build our own; For this little example we use numpyro and the NUTS MCMC sampler.

 

def model(obs):
    x1 = numpyro.sample('x1', dist.Uniform(-2, 2))
    x2 = numpyro.sample('x2', dist.Uniform(-2, 2))
    s = numpyro.sample('s', dist.Normal(9.5, .5))
    t = F(x1, x2, s)
    return numpyro.sample('obs', dist.Normal(t, 1.5), obs=obs)


kernel = NUTS(model)
mcmc = MCMC(kernel, NUM_WARMUP, NUM_SAMPLES)
mcmc.run(rng_key_, obs=np.array([20]))
mcmc.print_summary()

We run our model and we get the following results:

 

        mean       std    median      5.0%     95.0%     n_eff     r_hat
 S     19.48      0.50     19.48     18.68     20.32   8971.39      1.00
X1      0.00      0.73      0.00     -1.03      1.04   5148.41      1.00
X2     -0.02      0.73     -0.03     -1.05      1.02   5363.95      1.00

The output indicates that the average inferred starting location is (0,0) …basically the location we’re standing. How does this make sense? Let’s take a look at the actual samples to understand this more.

 

All possible locations are a ring with radius ~ 1 kilometer – if we average all of these we get the center of the ring, (0,0).

 

Obtaining more information

Obviously, with this information we still can’t tell whether this person is our neighbour or not – they could live anywhere along this ring.

As you’re eavesdropping some more on this conversation, you notice they are holding a sandwich bag. You know this deli – it’s a popular local spot. The people in the immediate neighbourhood go to it all the time, but only locals know of it . However, if someone from that neighbourhood moves across the city, they’ll still visit to get their fix. This is useful information to us because it provides some more information on the positional variables.

We can use this information in our model by updating our prior guess on their starting location. We will assume a Laplacian distribution based on what we know about the deli.

Our model is now:

 

x1 ~ Laplacian(-1.5, .1)    # sample starting location in x1
x2 ~ Laplacian(1, .1)       # sample starting location in x2
s ~ Normal(9.5, .5)         # sample walking speed
t = F(s, x1, x2)            # run forward problem to get predicted time
y ~ Normal(t, 3/2)          # predicted time considering observational error

We run the model again we get the following samples:

 

Drawing conclusions

You can qualitatively tell that more information is needed to improve our model; but, for the sake of this example, we’ll stop here. What is the take-away? What can we do with the results that we have?

The most probable location is not the average of these samples (remember how that lead us astray in our first attempt?). Instead, we need to find the point with the highest density of accepted starting locations, the mode. To do this we’ll approximate the samplings as a probability distribution using a Gaussian kernel density estimation. We can then find the maximum point using gradient descent.

 

from scipy.optimize import minimize
from scipy.stats import gaussian_kde

kde = gaussian_kde(np.vstack([x1, x2]))
minimize(lambda x: -kde(x), [-1, 0])
x: array([-0.74703862, 0.76423531])

We have now identified the highest probable location of the smuggler’s den based on our assumptions.

 

What is the probably they live near you?

If we wanted to answer the original question: “is this person is our neighbour?” we can find the probability of this given our MCMC samples. We’ll define “our neighborhood” as ±0.5 km around our house.

We can find this very easily using the samples obtained from our MCMC model. The probability is the number of samples inside our neighborhood, divided by the total number of samples.

 

P = np.mean((x1 < -.5) & (x1 > -1.5) & (x2 < .5) & (x2 > -.5))
print(f"Probability the kingpin is your neighbor: {100*P:.2f}%")
Probability the smugglers den is in our neighbourhood: 22.68%

 

Decision Making

Okay this was a lot of math and assumptions to try to answer our question, but how can we use this to inform our actions?

A viable way is to distribute your resources given the information. For example how much time should we spend looking for the base of operations in our neighbourhood? Well, if we have 5 days to find him, we should only spend 22.68% of it in our neighborhood, or about 1 day and 3 hours.


References

  1. Ever wondered why it’s called “Monte Carlo” Methods? Yep, it’s because of the famous casinos and gambling.
  2. Obviously this is a retake of the fantastic article by Ben Hammel, I couldn’t resist giving it my own simplified twist, but Ben really did most of the original work and much more. You should check out his work!
Are Deep Neural Networks Dramatically Overfitted?

If you are like me, entering into the field of deep learning with experience in traditional machine learning, you may often ponder over this question: Since a typical deep neural network has so many parameters and training error can easily be perfect, it should surely suffer from substantial overfitting. How could it be ever generalized to out-of-sample data points?

Tremendous piece by Lilian Weng about deep learning. If you ever wondered why and how deep learning works without hopelessly overfitting this article is for you. Includes a lot of references to interesting and current research.

Chaos is beautiful

During my time at the University of Toronto, I dipped my toes into the fascinating field of chaos theory. I always found it to be particularly impressive how small changes in the initial conditions of a dynamical system can lead to what scientists call “deterministic chaos”.

Recently I stumbled across an animation of a triple pendulum and then across this great blogpost by Jake VanderPlas. Jake managed to recreate the animation in python and sympy. Since the equations for the triple pendulum tend to get messy Kane’s Method is the way to go.

I dont even want to take away too much. Just go to his website, download the ipython notebook Jake prepared and play around with the parameters. As they say: The flap of a butterfly’s wing really seems capable of causing a tornado.