As a general rule of thumb, I think that identifying the difference between classification and regression analysis based on whether the attributes are categorical or numerical is useful. While there are cases where these terms can be confusing, it is a good generalization to help visualize and frame a problem. Like the concept of over-fitting, we don’t necessarily need to explain every case that occurs for these two different types to be able to use their general patterns to help us identify how new problems are categorized.
A good example of utilizing this assertion to determine if something should use a classification or regression technique would be around college choices. If we wanted to predict which college someone would attend, we could compile a list of attributes around their grades, interest in moving long distance, college location, majors available, and cost for example. The result from the analysis would be a college name, so we would be developing a classification model that was categorical. Even though some of the informational attributes like cost are numerical, the target attribute is qualitative. For those numerical costs, we could easy create categorical buckets of values to break the values up for analysis. However, if we were trying to predict the future cost of college, we would need a regression model where we are coming to a result that is numerical.
While there may be examples that are confusing, like logistic regression, where the term regression is used even though it is categorical, the general rule of regression being used for determining numerical target attributes is useful. Creating useful generalizations to make predictions is a core value to data modeling so as long as we understand that we aren’t explaining every case through this as a fixed rule we can use this generalization to help simplify categorization of future data projects.
Author: Logan Callen
Provost, Foster and Tom Fawcett. 2013. Data Science for Business. 2nd Edition. California: O’Reilly Media, Inc.