A nearest neighbor analysis can be a useful type of predictive modeling that varies from standard categorical or regression analysis. By plotting the different attributes of each piece of information, a nearest neighbor classification compares distances between different items within the dataset. Neighbor groups are simply those pieces of information in the dataset that have the least amount of distance. The number of neighbors that can be assigned per group is dependent on each dataset, but the fewer nearest neighbors per group the more complex and potentially overfit the model becomes. If there were a hundred data points for example, a nearest neighbor analysis with a value of 5 items per group would produce 20 groups, however a nearest neighbor analysis with a value of 2 would give 50 groups.
A new piece of information could then be plotted within the data space, and if it fell within the boundaries of a specific nearest neighbor set it would be predicted that it would have similar attributes to those in the cluster. Unlike regression and standard categorical models, the more attributes being analyzed the more difficult it can be to understand or describe. However, because it doesn’t require supervision and definition of a target attribute upfront, it can often be easier to simply apply the distance functions to an entire dataset for analysis so more time is spent refining the understanding then building the model (Foster and Fawcett 2013, 164).
Author: Logan Callen
Provost, Foster and Tom Fawcett. 2013. Data Science for Business. 2nd Edition. California: O’Reilly Media, Inc.