A key aspect to data mining is understanding the data available. Not all data is equal, and some of it can be more useful than others in understanding a certain target variable. To have a successful model, selection of the data attribute inputs that will provide the most information is required. To determine which of the variables will be the most useful in making classifications is the basis of information gain. Information gain is defined as a “measure of how much an attribute improves (decreases) entropy over the whole segmentation it creates” (Provost and Fawcett 2013, 52). The overall goal is to find which variables provide the most information gain and utilize those variables within your model.
When working on multivariate datasets within supervised segmentation models, it can be difficult to relate to non-technical management the concept of information gain and how it relates to developing the model. Information gain is an attempt to quantify the concept of whether a specific attribute variable will increase our knowledge of the value of a target variable (Provost and Fawcett 2013, 53). The easiest way to relate the concept itself is to provide an example of how information gain methods are used in daily life.
An example that jumps to mind quickly is how to answer the question of why a car won’t start. A car is a complex machine that has a lot of integrated processes that are required to make it run properly. A conventional car requires electricity from the battery or alternator and fuel in the form of gasoline to start for example. There are also several other subsequent processes and interconnected items involved in the car running, but electricity and fuel are the primary aspects of vehicles operating. When thinking about why a car won’t start, we naturally go to the first attributes we know to provide the most information gain to our problem. We turn the key to see if the dash lights come on or if any sound is made. If we don’t have any lights come up, we now know there is an issue with the electricity source which likely indicates battery issues. The attribute of whether the dash lights up with the key turned in the ignition gains us a lot of information to solving the problem, whereas the attribute of whether the tire pressure is correct doesn’t really provide us any information on why it won’t start. If it appears to have power and the starter isn’t having any troubles attempting to start the car, we may then move on to checking to see if there is fuel in the car because it may simply be out of fuel. We understand these are the most important variables because over time enough people have run into these issues to have developed a collective decision tree on how to resolve the issue for most of the cases before calling an expert to help.
When working on data projects we may not know which variables are going to provide us the most information about solving the problem. We may also think we know the variables that provide the most value, but we may be mistaken. It is important to select and test variables for their ability to reduce uncertainty in the model. Using the equation for entropy allows us to test whether a sub-group will give us better predictability than the primary group of data from the question prior. If we don’t have proper electricity in the car, is the battery or the alternator the most likely reason why? If we look at the data, batteries are likely to fail much more often than alternators, so battery status would provide us with more information gain in that step of the decision tree. Similar to the term data mining, information gain is a term that can seem daunting to use when looking at the definitions and mathematical proof of the concept, however, it is really just a term for something we do on a daily basis; determining what piece of information will help us the most in making a prediction.
Author: Logan Callen
Provost, Foster and Tom Fawcett. 2013. Data Science for Business. 2nd Edition. California: O’Reilly Media, Inc.