Good data models can be used to generalize and predict future observations. We use models to perceive patterns in data and use those patterns to provide predictive insight into new pieces of information. Overfit models are too specific to the dataset they were created from which leads to a lack of generalization ability.
If we look hard enough, we will start to see patterns in data even if things are occurring randomly (Provost and Fawcett 2013, 110). When creating a model, we are hoping to create something that will be able to predict future outcomes with high skill. However, that drive to want to create an accurate model can lead to working on reducing the model errors until we are left with a complex model that explains the historic data perfectly but may not be good at predicting future outcomes.
We want to avoid overfitting because it reduces the ability of the model to generalize. We need good generalization abilities out of our model to be able to accurately predict future items that occur. Since what came before doesn’t necessarily indicate what will come next, it is important for a model to pick up on the major trends and patterns from the historical dataset instead of being able to recreate the historical dataset perfectly. Balancing the need for accuracy with the need for generalization in a model is not an easy task with a one-size fits all solution, but thinking about overfitting should be a primary consideration when building a data model.
Author: Logan Callen
Provost, Foster and Tom Fawcett. 2013. Data Science for Business. 2nd Edition. California: O’Reilly Media, Inc.