How I Learned That Machine Learning is A Lot Like Skiing.

John Cook
6 min readMay 25, 2020

So I used to be a very good skier, I could ski any run on my local mountain in northern Idaho. I love skiing, I also love machine learning, but I am just getting started. I have recently learned that there is a lot in common between the two. Except machine learning is like skiing in higher dimensions with math.

The first topic I want to cover is Feature scaling. In skiing terms this would be the equivalent to measuring how fast you are going and how long the runs are, or the elevation drop of each run. In other words your data. You would want to use the right unit of measurement for each measurement. When doing calculations it is important to convert the units of measurement to equivalent units or you can not add them up. In skiing terms it would be insane to measure altitude drop in miles and ski runs in feet, or speed in feet per hour, we use miles per hour instead. But when you want to do some math about how good at skiing you are, need to add up your data with the same units of measurement. Without going into detail you would need to convert the miles per hour and miles to feet. This is all fine and dandy when doing skiing but in machine learning when you do this you find that all of a sudden your “mountains” tend to be longer then they are tall. We do not want that. In skiing I would love taller mountains so long as they are skiable for longer runs, but we have to take what we have. Not so in machine learning. In machine learning we can scale our data so that the height and width and any other dimensions are the same scale. In skiing this would allow for longer funner runs. Also you would want to take multiple runs and collect robust data, to figure out how good at skiing you are. You could use your best run, but that would not be representative of what is likely to happen on race day. In machine learning data scaling allows for better data, as one dimension does not get more attention then the other, and the outliers in the data do not skew it as much.

To achieve this is machine learning we normalize the data. There is the whole process of cleaning data, and preprocessing data, and making it nice pretty and tying it up with a bow for the neural network. I am not going to cover that process. As I do not yet know good data preprocessing techniques I will admit, I will be learning that later. What I am talking about is batch shuffling and matrix normalization. So matrix normalization is taking your inputs and finding the mean and standard deviation of each and then subtracting the mean form the data and dividing it by the standard deviation. If my data where a picture this would make the light’s and dark's a little more standardized. So that you do not have local hot spots or cool spots. Or in skiing terms so that you do not have ice or slush on your run. Next is shuffling you can do this easily in python with np using a reference index and np.random.permutation(X.shape[0]) for axis 0 shuffling. This is one place the skiing analogy breaks down, but cards work, if you do not shuffle the deck whats the point?

Next I would like to discuss optimization algorithms. First lets talk about mini batch gradient descent. Ideally what this does is it breaks up the single batch of gradient descent and makes it into little faster batches that fit into the system architecture, ideally the cache of your processor. Think of it as going from slow snowplowing(beginner turns for you non-skiers) to quick parallel turns (intermediate turns). Or in picture form something like this(pardon my drawing skills):

As you can see snow plow turns tend to be fairly regular, but slow. While parallel turns tend to be chaotic and fast. The same applies to gradient descent and mini batch gradient descent. The only difference is that in mini batch gradient descent there are more turns and they are faster, I failed to draw that here as I am not an artist. The point is the cost and accuracy of normal gradient descent are predictable and slow, while the cost and accuracy of each mini batch might vary. As in all things in life there are a few cons, the main one being that mini batch is very system specific and you should know what architecture you are writing on so as to not waste space.

Next is gradient descent with momentum or momentum for short. Momentum well it adds momentum to your gradient descent so that you can go towards the global minimum faster. The skiing analogy here is that you are now following the terrain closer and taking the fastest route for the run. This is like going from Parnell turns to being an Olympic skier. They know how to use the momentum to make perfect turns. The same is true for gradient descent with momentum it speeds everything up. Gradient descent with momentum can be implemented with the following Tensor Flow training operation: tf.train.MomentumOptimizer(learning_rate=alpha, momentum=beta1).minimize(loss) I would recommend reading the docs on this. The cons of momentum are that it can overshoot the global minimum, as can the rest of the following optimization functions.

Next is RMSProp, or root mean square propagation. If momentum is like knowing how to ski aggressively well, then RMSProp is like knowing the run like the back of your hand. It takes into account the root mean square of previous input and factors it into the current layer. This means that after each run it gets better faster then standard gradient decent. This is like learning after each run how you can shave off a little more time. In Tensor Flow it can be implemented with this: tf.train.RMSPropOptimizer(learning_rate=alpha, decay=beta2, epsilon=epsilon).minimize(loss) and the documentation is here.

Next is Adam, or Adaptive momentum. Its a combination of momentum and RMSProp, and is very effective, so long as it does not overshoot the global minimum, when skiing fast sometimes its hard to stop at the lift, or in the case of machine learning the global minimum. This can be avoided with learning rate decay. Adam is in my opinion your best bet if you want fast training, but it can overshoot the global minimum. Adam is implemented like so: tf.train.AdamOptimizer(learning_rate=alpha, beta1=beta1, beta2=beta2, epsilon=epsilon).minimize(loss) and again I would read the documentation as it explains everything much better then I can.

Learning rate decay is like learning a new run in skiing, after each run you know more, and have to learn less. So you can ski more aggressively with fewer turns and faster shorter turns. So in machine learning its setting your learning rate to a smaller value after each run or epoch so that you can do better on the next without going too fast. The learning rate is like the frequency of your turns. So smaller learning rates allow smaller quicker steps but this can slow you down if the gradient is very steep, if its steep you want to go straight, while you want quick little turns towards the end of a gradient to avoid missing the global minimum. Its like approaching the ski lift if you are going full speed you are likely to hit someone, you have to slow down and make many little course corrections. Here it is: tf.train.inverse_time_decay(learning_rate=alpha, global_step=global_step, decay_steps=decay_step, decay_rate=decay_rate, staircase=True) and here are the docs.

I hope that this short article has made it easier to visualize machine learning, I apologize to those who have not skied as sports analogies are not always best. However the mathematics for machine learning and skiing share a lot in common as I have tried to demonstrate. I would highly recommend always reading the documentation on a given operation in machine learning, and have made an attempt to link the documentation in the article and below.

Note:
My implementations of code use Tensor Flow 1.12, the documentation of which is in limbo and not all pages are available.

Sources:
Andrew Ng

https://www.youtube.com/watch?v=mysLJsbj45w&t=299s

Special thanks to Holberton School for helping me get this far.

--

--