Package org.carrot2.clustering.kmeans
Class BisectingKMeansClusteringAlgorithm
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm
- All Implemented Interfaces:
AcceptingVisitor
,ClusteringAlgorithm
public class BisectingKMeansClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
A very simple implementation of bisecting k-means clustering. Unlike other algorithms in Carrot2,
this one creates hard clustering (one document belongs only to one cluster). On the other hand,
the clusters are labeled only with individual words that may not always fully correspond to all
documents in the cluster.
-
Field Summary
Fields Modifier and Type Field Description AttrInteger
clusterCount
The number of clusters to create.AttrInteger
labelCount
Label count.TermDocumentMatrixBuilder
matrixBuilder
Term-document matrix builder for the algorithm.TermDocumentMatrixReducer
matrixReducer
Term-document matrix reducer for the algorithm.AttrInteger
maxIterations
The maximum number of k-means iterations to perform.static String
NAME
AttrInteger
partitionCount
Partition count.BasicPreprocessingPipeline
preprocessing
A pipeline of components transforming input documents into a PreprocessingContext.AttrString
queryHint
Query terms used to retrieve documents.AttrBoolean
useDimensionalityReduction
Use dimensionality reduction. -
Constructor Summary
Constructors Constructor Description BisectingKMeansClusteringAlgorithm()
-
Method Summary
Modifier and Type Method Description <T extends Document>
List<Cluster<T>>cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Set<Class<?>>
requiredLanguageComponents()
-
Field Details
-
NAME
- See Also:
- Constant Field Values
-
clusterCount
The number of clusters to create. The algorithm will create at most the specified number of clusters. -
maxIterations
The maximum number of k-means iterations to perform. -
partitionCount
Partition count. The number of partitions to create at each k-means clustering iteration. -
labelCount
Label count. The minimum number of labels to return for each cluster. -
queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters. -
useDimensionalityReduction
Use dimensionality reduction. Iftrue
, k-means will be applied on the dimensionality-reduced term-document matrix with the number of dimensions being equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. Iffalse
, the k-means will be performed directly on the original term-document matrix. -
matrixBuilder
Term-document matrix builder for the algorithm. -
matrixReducer
Term-document matrix reducer for the algorithm. -
preprocessing
A pipeline of components transforming input documents into a PreprocessingContext.
-
-
Constructor Details
-
BisectingKMeansClusteringAlgorithm
public BisectingKMeansClusteringAlgorithm()
-
-
Method Details
-
requiredLanguageComponents
- Specified by:
requiredLanguageComponents
in interfaceClusteringAlgorithm
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)- Specified by:
cluster
in interfaceClusteringAlgorithm
-