In todays world there is a lot of data, but finding good data for machine learning is challenging. Getting real time data fed to a machine learning algorithm is even more challenging! This is where automation comes in. In many cases a machine learning engineer will need more data, or more representation of less common parts of the data. Take for example handwriting, the letter “Z” is not used much. However it is important for a machine learning model to recognize “Z’s” from “S’s”. So you may need to augment your data so that there are more “Z’s” so the model can learn to tell them apart. This can be done even with a handful of “Z’s”.
So how do we do this? Well the simplest way of doing so is to take our data that needs augmentation and slightly edit it to make more data. For images we can rotate, crop, shear, flip, and adjust the brightness and hues. There are other processes for other data types like sound, but lets stick with images for simplicities sake. This allows there to be multiple representations of the same thing in different forms which is perfect for machine learning.
Sounds simple enough, but when you have 1000 images and want to preform a flip on all of them, it would take quite some time to do it with an image editor. This is where the automation comes in. There are a few ways to do this, you could use Tensor Flow, Keras, or Open Computer Vison or any number of other tools. So long as you are not doing it by hand. This automatic process can do any of the above processes and more, to augment the data. This leads to better performance and representation of parts of the data with few original data points. This process can help avoid biases in the data if done correctly.
So step by step the process is as follows, first gather your data. Next determine if it or parts of it need augmentation. If so then you can preform any number of the operations listed above, on images. This is done by loading the data into the preprocessor. Then you preform the preprocessing on it. The preprocessing can be parallelized if you have a Graphics Processing Unit, or a Tensor Processing Unit. Parallelizing is like adding more lanes to a highway the traffic or data flows better. Once you have the data augmentation done it is important to shuffle your data. So that there is not a bunch of one type of entry one after another. Think of it as a deck of cards you do not want all the ace’s on top, you want the deck to be random. Then you have to split the data into a training, development, and test set. There are a few ways to do this but a rule of thumb is a 60–20–20 split. This allows you to train the model on the augmented data. Then you verify the results with the dev set. Then run test on it to see if it is working, or if it has only figured out the dev set. This is considered a best practice in machine learning when dealing with data sets. Although there are exceptions.
Data augmentation is a useful thing to know how to do. It can greatly increases accuracy and performance. It can be used to remove biases from data and improve gaps in the data. This all leads to better results which leads to better machine learning models that can do more.
Check out my GitHub project on this topic.