org.apache.commons.math3.stat.clustering
Class KMeansPlusPlusClusterer<T extends Clusterable<T>>

java.lang.Object
  extended by org.apache.commons.math3.stat.clustering.KMeansPlusPlusClusterer<T>
Type Parameters:
T - type of the points to cluster

public class KMeansPlusPlusClusterer<T extends Clusterable<T>>
extends Object

Clustering algorithm based on David Arthur and Sergei Vassilvitski k-means++ algorithm.

Since:
2.0
Version:
$Id: KMeansPlusPlusClusterer.java 1244107 2012-02-14 16:17:55Z erans $
See Also:
K-means++ (wikipedia)

Nested Class Summary
static class KMeansPlusPlusClusterer.EmptyClusterStrategy
          Strategies to use for replacing an empty cluster.
 
Field Summary
private  KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy
          Selected strategy for empty clusters.
private  Random random
          Random generator for choosing initial centers.
 
Constructor Summary
KMeansPlusPlusClusterer(Random random)
          Build a clusterer.
KMeansPlusPlusClusterer(Random random, KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy)
          Build a clusterer.
 
Method Summary
private static
<T extends Clusterable<T>>
int
assignPointsToClusters(List<Cluster<T>> clusters, Collection<T> points, int[] assignments)
          Adds the given points to the closest Cluster.
private static
<T extends Clusterable<T>>
List<Cluster<T>>
chooseInitialCenters(Collection<T> points, int k, Random random)
          Use K-means++ to choose the initial centers.
 List<Cluster<T>> cluster(Collection<T> points, int k, int maxIterations)
          Runs the K-means++ clustering algorithm.
 List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
          Runs the K-means++ clustering algorithm.
private  T getFarthestPoint(Collection<Cluster<T>> clusters)
          Get the point farthest to its cluster center
private static
<T extends Clusterable<T>>
int
getNearestCluster(Collection<Cluster<T>> clusters, T point)
          Returns the nearest Cluster to the given point
private  T getPointFromLargestNumberCluster(Collection<Cluster<T>> clusters)
          Get a random point from the Cluster with the largest number of points
private  T getPointFromLargestVarianceCluster(Collection<Cluster<T>> clusters)
          Get a random point from the Cluster with the largest distance variance.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

random

private final Random random
Random generator for choosing initial centers.


emptyStrategy

private final KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy
Selected strategy for empty clusters.

Constructor Detail

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(Random random)
Build a clusterer.

The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

Parameters:
random - random generator to use for choosing initial centers

KMeansPlusPlusClusterer

public KMeansPlusPlusClusterer(Random random,
                               KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy)
Build a clusterer.

Parameters:
random - random generator to use for choosing initial centers
emptyStrategy - strategy to use for handling empty clusters that may appear during algorithm iterations
Since:
2.2
Method Detail

cluster

public List<Cluster<T>> cluster(Collection<T> points,
                                int k,
                                int numTrials,
                                int maxIterationsPerTrial)
                                                throws MathIllegalArgumentException
Runs the K-means++ clustering algorithm.

Parameters:
points - the points to cluster
k - the number of clusters to split the data into
numTrials - number of trial runs
maxIterationsPerTrial - the maximum number of iterations to run the algorithm for at each trial run. If negative, no maximum will be used
Returns:
a list of clusters containing the points
Throws:
MathIllegalArgumentException - if the data points are null or the number of clusters is larger than the number of data points

cluster

public List<Cluster<T>> cluster(Collection<T> points,
                                int k,
                                int maxIterations)
                                                throws MathIllegalArgumentException
Runs the K-means++ clustering algorithm.

Parameters:
points - the points to cluster
k - the number of clusters to split the data into
maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used
Returns:
a list of clusters containing the points
Throws:
MathIllegalArgumentException - if the data points are null or the number of clusters is larger than the number of data points

assignPointsToClusters

private static <T extends Clusterable<T>> int assignPointsToClusters(List<Cluster<T>> clusters,
                                                                     Collection<T> points,
                                                                     int[] assignments)
Adds the given points to the closest Cluster.

Type Parameters:
T - type of the points to cluster
Parameters:
clusters - the Clusters to add the points to
points - the points to add to the given Clusters
assignments - points assignments to clusters
Returns:
the number of points assigned to different clusters as the iteration before

chooseInitialCenters

private static <T extends Clusterable<T>> List<Cluster<T>> chooseInitialCenters(Collection<T> points,
                                                                                int k,
                                                                                Random random)
Use K-means++ to choose the initial centers.

Type Parameters:
T - type of the points to cluster
Parameters:
points - the points to choose the initial centers from
k - the number of centers to choose
random - random generator to use
Returns:
the initial centers

getPointFromLargestVarianceCluster

private T getPointFromLargestVarianceCluster(Collection<Cluster<T>> clusters)
Get a random point from the Cluster with the largest distance variance.

Parameters:
clusters - the Clusters to search
Returns:
a random point from the selected cluster

getPointFromLargestNumberCluster

private T getPointFromLargestNumberCluster(Collection<Cluster<T>> clusters)
Get a random point from the Cluster with the largest number of points

Parameters:
clusters - the Clusters to search
Returns:
a random point from the selected cluster

getFarthestPoint

private T getFarthestPoint(Collection<Cluster<T>> clusters)
Get the point farthest to its cluster center

Parameters:
clusters - the Clusters to search
Returns:
point farthest to its cluster center

getNearestCluster

private static <T extends Clusterable<T>> int getNearestCluster(Collection<Cluster<T>> clusters,
                                                                T point)
Returns the nearest Cluster to the given point

Type Parameters:
T - type of the points to cluster
Parameters:
clusters - the Clusters to search
point - the point to find the nearest Cluster for
Returns:
the index of the nearest Cluster to the given point


Copyright (c) 2003-2013 Apache Software Foundation