The New Wave of Analytics—Machine Learning

By Alexander Woods, Computer Science student, Atlanta GA

CAT tool image

I once heard a professor say the key difference between machine learning and statistics was that machine learning predicts the future, and statistics analyzes the past. Statistics will often deal with things like how effective a certain drug was, or if there is a significant difference between the way males and females use social media.

Machine learning, however, takes past data (ideally huge amounts) and observes what has happened, and then predicts new data. When you're typing a search into Google and it starts to autocomplete your search for you, that's machine learning. When you ask Siri a question, and she correctly predicts what you said, that's machine learning. Spam filters, recommendation engines, and many predictive models in business'€”machine learning.

Most algorithms in machine learning are used for one of two things, classification or regression. There can be other goals, such as clustering or simulation, but most follow this format. Even a lot of natural language processing can be thought of as an enormous classification problem.

Classification is used when the output is a category, and the model is trained to discriminate between them. You train it by feeding it a training example, say a picture of a dog, with all its parameters (gradient, distribution of colour, etc), and a label that it is indeed a dog. Sometimes the parameters are particularly hard to infer, as with pictures. The algorithm will use these labels to learn in the same way humans do'€”it picks up on all the attributes that are associated with a dog, and those that aren't.

The algorithm is trained with many examples, and then is tested with data it hasn't seen before, to test its accuracy. Performing well on a data test, especially if the data is recognized as valid by the field, is a sign of a good algorithm.

The other task, regression, is similar. The output of the algorithm in this case is continuous. Linear regression, as learned in a basic statistics class, is the easiest example of regression. Just like Naive Bayes and a Support Vector Machine (before the Kernel Trick!),  linear regression assumes a linear relationship. If you used a person's height and shoe size to predict weight, this would be a regression algorithm.

There are other goals of machine learning, but usually the most challenging part is finding a good feature representation (the set of parameters, weighted properly, that best represent the data). The algorithm will predict the label (category) given the data, and can do a lot of feature learning itself'€”for example, random forest ranks individual decision trees based on how well they perform (comparing output to the label or numerical value), and decides which feature weights are best.

There is, however, a great distinction that lies in machine learning. There are two different types of learning'€”supervised and unsupervised. Both classification and regression can be done using these types, but this difference is less mathematical than theoretical. The vast majority of machine learning that currently takes place is supervised, because it's much easier to comprehend.


Supervised Learning

A supervised machine learning model is one where you specify the relationship between the input and the output. In other words, you decide the features. For example, in a dataset of tumours, you given which are malignant and which are benign or you know which of the hard drives in a dataset were defective and which were not. In industry, these are almost always the models that are used, because they don't need to be too complex if you're just trying to predict something simple, like how many customers you will have per day.


Unsupervised Learning

Unsupervised models are much more complex, but have a lot more power. In this case, the algorithm infers the features, instead of you designing them by hand. For example, you are given a set of news articles and you have to determine into what groups to classify them, without being given the potential groups they could fall into. Remember that creating a good feature representation is essentially the hardest part of the problem. So if you can get these methods to work well, and code them with relative ease (i.e. have kickass packages, written by brilliant people), then your business decisions and apps will be incredibly more intelligent.

As of now, you can get better results, and get them with less effort, by using a lot of data and supervised learning, but that will change.

If you're interested in learning some of the math behind machine learning (I hope you're solid on your calculus and linear algebra), then it's best to take Andrew Ng's course on Coursera. A basic understanding of the internals of each algorithm is important, and I recommend you learn these things.

If you want to start making models immediately, the best tools are Python (scikit-learn, numpy, etc) and R. Kaggle competitions are also a great way to get into machine learning'€”you can even mess around with datasets such as MNIST, the dataset for handwriting recognition problems. I have a blog as well, with a decent amount of tutorials using these tools to create machine learning models, and there are a lot of other examples out there.