Machine learning is a computing process where large amounts of data are aggregated and analyzed in order to “teach” a computer application to do something without having to explicitly program that behaviour.
Typically, machine learning is done with an outcome you would like your computer application to solve for. Let’s take this question as our example: “Based on looking at patient X’s data, do they have diabetes?”. Someone with domain knowledge about diabetes could program an algorithm manually to determine this, perhaps by looking for a particular diagnosis code or an A1C lab value above a certain percent.
If you wanted to teach a machine to do this type of diagnosis automatically, you could feed a machine learning algorithm a set of labeled data and the algorithm would determine which pieces of that data were useful in determining whether or not a given patient has diabetes.
The labeled data would be a “training set” and would include several data points on each patient (A1C, OGTT, and FPG lab values, for example). These data points are considered features and will determine how the algorithm learns its approach to solving for the outcome. Each patient would be labeled with an indication for whether or not they had diabetes.
The algorithm would use the data inputs that correlated to a positive outcome (“yes, the patient has diabetes”) to learn which pieces of data (A1C, OGTT, or FPG) were important in determining whether or not a new, unknown patient has diabetes or not.
Feature Engineering Is Key
Machine learning algorithms are fairly standardized at this point and it is up to data engineers to know when to use the correct approach given the question they are asking and the kind of data they are working with. Companies built around machine learning today are not creating novel algorithms.
Instead, these companies are looking for new ways to think about how to engineer the features, the amount of data they have access to, and how unique or relatively unavailable that data is. The more well-engineered the features are, the better the machine learning system will perform. The more data that is available, the more refined and accurate the model will be. And the more unique the data is, the harder it will be for another company to replicate the learned models.