Summary of ImageNet

John Cook
4 min readJul 4, 2020

Introduction:

I am going to summarize the “ImageNet Classification with Deep Convolutional Neural Networks”, paper in the following blog post. I noticed that this paper was written nearly eight yeas ago, it is both frightening and impressive how far things have come in the last eight years. It would seem to me that the methods and techniques used in this paper set the ground work for object detection in images. What really stands out to me is how much the authors where able to accomplish with such limited computational resources. Below is a history of gflops (a billion floating point operations per second, computational power)over time with single and double precision up until 2016. I would note that in the subsequent 4 years there have been major advances in gpu design, such as ray tracing.

Check out karlrupp.net for more information

Procedures:

I would argue that while most of the procedures introduced in this paper are still relevant, however there is one that is not and that is splitting the training across two GPU’s. This is due to hardware improvements, and while still used it is mostly handled at a low level and I would argue is not very relevant today. I believe the following methods and techniques to still be very relevant to this day. The ReLu Nonlinearty use in the paper is still used today and is considered one of the best activation functions available. This is due to its lack of vanishing gradient and speed that it enables. Also the fact that ReLu does not require normalization is a factor. However in the paper it was aided by Local Response Normalization, which closely mimics actual neurons activation's. Another method that was used was Overlapping Pooling, this method “summarize the outputs of neighboring groups of neurons in the same kernel map” meaning it reduces computation in the long run while preserving relevant data. The overall architecture can be summed up in this image from the paper:

Note that the layers are split by GPU, there are 8 layers total with 5 Constitutional layers and 3 fully connected layers. The authors where limited by their hardware and would have liked to have gone deeper with the layers but where unable to at the time. Perhaps the most important method used was the dropout method. This method makes it so that outputs from previous layers are ignored at a fixed rate in order to reduce the dependency of neurons on a particular output. This allows for more robust neurons throughout the network and reduces over fitting. Another feature they used was data augmentation. This is basically rotating and flipping images to improve training as there are more images to train on. This is computationally cheap and when done right can significantly improve performance. I find it interesting that the methods developed based on hardware are no longer relevant, while the methods that where hardware independent are still widely used today.

Results:

The results of this architecture where revolutionary at the time. With a top-1 error rate of 37.5% verses the next best of 47.1% and a top-5 error rate of 17.0% verses 28.2% on the ILSVRC-2010 is impressive. I think that this method was a paradigm shift in machine learning. This method set the stage for current day networks. The overall results where a significant improvement in performance.

Conclusion:

ImageNet was revolutionary at the time, it set the stage for more recent innovations and most of the hardware independent methods are still used today. I would argue that the use of dropout layers might be the most significant breakthrough as they significantly prevent over fitting. I would also argue that this network proved that CNN’s are viable and that they are one of the best supervised learning methods available for object detection.

Personal notes:

I would like to pat myself on the back and my fellow gamers, for making GPU’s commercially viable, it is my belief that we would not have commercially viable GPU’s today if it where not for games. I might be a bit biased here. I am also shocked at how far GPU’s have come, at this time my laptop can perform 11.2 TFLOPS at single precision, according to Nvidia, I have not tested this but would like to one day. It also has 16 Gb’s of video ram. This is mind blowing, and that is only the hardware side of things. I have read from several sources that overall machine learning efficiency is improving at a faster rate then mores law, significantly faster. While exciting this is also concerning as I do not know how we will be able to keep up if this trend continues, especially once the machines learn about machine learning. It will be interesting to see what the future holds.

--

--