# Item

ITEM ACTIONSEXPORT

Released

Journal Article

#### Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

##### External Resource

http://www.jmlr.org/papers/volume10/bubeck09a/bubeck09a.pdf

(Publisher version)

##### Fulltext (restricted access)

There are currently no full texts shared for your IP range.

##### Fulltext (public)

There are no public fulltexts stored in PuRe

##### Supplementary Material (public)

There is no public supplementary material available

##### Citation

Bubeck, S., & von Luxburg, U. (2009). Nearest Neighbor Clustering: A Baseline Method
for Consistent Clustering with Arbitrary Objective Functions.* The Journal of Machine Learning Research,*
*10*, 657-698.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0013-C591-8

##### Abstract

Clustering is often formulated as a discrete optimization problem. The objective is to

find, among all partitions of the data set, the best one according to some quality measure.

However, in the statistical setting where we assume that the finite data set has been sampled

from some underlying space, the goal is not to find the best partition of the given

sample, but to approximate the true partition of the underlying space. We argue that the

discrete optimization approach usually does not achieve this goal, and instead can lead to

inconsistency. We construct examples which provably have this behavior. As in the case

of supervised learning, the cure is to restrict the size of the function classes under consideration.

For appropriate small function classes we can prove very general consistency

theorems for clustering optimization schemes. As one particular algorithm for clustering

with a restricted function space we introduce nearest neighbor clustering. Similar to the

k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general

baseline algorithm to minimize arbitrary clustering objective functions. We prove that it

is statistically consistent for all commonly used clustering objective functions.

find, among all partitions of the data set, the best one according to some quality measure.

However, in the statistical setting where we assume that the finite data set has been sampled

from some underlying space, the goal is not to find the best partition of the given

sample, but to approximate the true partition of the underlying space. We argue that the

discrete optimization approach usually does not achieve this goal, and instead can lead to

inconsistency. We construct examples which provably have this behavior. As in the case

of supervised learning, the cure is to restrict the size of the function classes under consideration.

For appropriate small function classes we can prove very general consistency

theorems for clustering optimization schemes. As one particular algorithm for clustering

with a restricted function space we introduce nearest neighbor clustering. Similar to the

k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general

baseline algorithm to minimize arbitrary clustering objective functions. We prove that it

is statistically consistent for all commonly used clustering objective functions.