Package org.carrot2.clustering.kmeans
Class BisectingKMeansClusteringAlgorithm
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm
-
- All Implemented Interfaces:
AcceptingVisitor
,ClusteringAlgorithm
public class BisectingKMeansClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
A very simple implementation of bisecting k-means clustering. Unlike other algorithms in Carrot2, this one creates hard clustering (one document belongs only to one cluster). On the other hand, the clusters are labeled only with individual words that may not always fully correspond to all documents in the cluster.
-
-
Field Summary
Fields Modifier and Type Field Description AttrInteger
clusterCount
Number of clusters to create.EphemeralDictionaries
dictionaries
Per-request overrides of language components (dictionaries).AttrInteger
labelCount
Minimum number of labels to return for each cluster.TermDocumentMatrixBuilder
matrixBuilder
Configuration of the size and contents of the term-document matrix.TermDocumentMatrixReducer
matrixReducer
Configuration of the matrix decomposition method to use for clustering.AttrInteger
maxIterations
Maximum number of k-means iterations to perform.static String
NAME
AttrInteger
partitionCount
Number of partitions to create at each k-means clustering iteration.BasicPreprocessingPipeline
preprocessing
Configuration of the text preprocessing stage.AttrString
queryHint
Query terms used to retrieve documents.AttrBoolean
useDimensionalityReduction
If enabled, k-means will be applied on the dimensionality-reduced term-document matrix.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description BisectingKMeansClusteringAlgorithm()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description <T extends Document>
List<Cluster<T>>cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Cluster a set of documents.Set<Class<?>>
requiredLanguageComponents()
-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.carrot2.attrs.AcceptingVisitor
accept
-
Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm
optionalLanguageComponents, supports
-
-
-
-
Field Detail
-
NAME
public static final String NAME
- See Also:
- Constant Field Values
-
clusterCount
public final AttrInteger clusterCount
Number of clusters to create. The algorithm will create at most the specified number of clusters.
-
maxIterations
public final AttrInteger maxIterations
Maximum number of k-means iterations to perform.
-
partitionCount
public final AttrInteger partitionCount
Number of partitions to create at each k-means clustering iteration.
-
labelCount
public final AttrInteger labelCount
Minimum number of labels to return for each cluster.
-
queryHint
public final AttrString queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
-
useDimensionalityReduction
public final AttrBoolean useDimensionalityReduction
If enabled, k-means will be applied on the dimensionality-reduced term-document matrix. The number of dimensions will be equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. If disabled, the k-means will be performed directly on the original term-document matrix.
-
matrixBuilder
public TermDocumentMatrixBuilder matrixBuilder
Configuration of the size and contents of the term-document matrix.
-
matrixReducer
public TermDocumentMatrixReducer matrixReducer
Configuration of the matrix decomposition method to use for clustering.
-
preprocessing
public BasicPreprocessingPipeline preprocessing
Configuration of the text preprocessing stage.
-
dictionaries
public EphemeralDictionaries dictionaries
Per-request overrides of language components (dictionaries).- Since:
- 4.1.0
-
-
Method Detail
-
requiredLanguageComponents
public Set<Class<?>> requiredLanguageComponents()
- Specified by:
requiredLanguageComponents
in interfaceClusteringAlgorithm
- Returns:
- A set of classes required to be present in the
LanguageComponents
instance provided for clustering.
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Description copied from interface:ClusteringAlgorithm
Cluster a set of documents.- Specified by:
cluster
in interfaceClusteringAlgorithm
- Type Parameters:
T
- Any subclass ofDocument
. Clusters of objects of the same type are returned.- Parameters:
docStream
- A stream ofdocuments
for clustering.languageComponents
-LanguageComponents
with a set of suppliers for the required language-specific components.- Returns:
- A list of top-level clusters (clusters can form a hierarchy via
Cluster.getClusters()
.
-
-