Bisecting k-means algorithm attributes
You can tune various aspects of bisecting k-means clustering by changing some of the attributes of the algorithm.
Below is the list of algorithm attributes along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.
{"matrixBuilder": {"termWeighting": {// 3 configuration variants available, choose one"@type": "LinearTfIdfTermWeighting""@type": "LogTfIdfTermWeighting""@type": "TfTermWeighting"},"titleWordsBoost": 2.0},"matrixReducer": {"factorizationFactory": {// 5 configuration variants available, choose one"@type": "KMeansMatrixFactorizationFactory","factorizationQuality": "HIGH""@type": "LocalNonnegativeMatrixFactorizationFactory","factorizationQuality": "HIGH""@type": "NonnegativeMatrixFactorizationEDFactory","factorizationQuality": "HIGH""@type": "NonnegativeMatrixFactorizationKLFactory","factorizationQuality": "HIGH""@type": "PartialSingularValueDecompositionFactory"}},"useDimensionalityReduction": "true"}
clusterCount
- Type
- Integer
- Default
- 25
- Constraints
- value >= 2
- Path
- clusterCount
- Java snippet
- algorithmInstance.clusterCount
The algorithm will create at most the specified number of clusters.
labelCount
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1 and value <= 10
- Path
- labelCount
- Java snippet
- algorithmInstance.labelCount
The minimum number of labels to return for each cluster.
maxWordDf
- Type
- Double
- Default
- 0.9
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- matrixBuilder.maxWordDf
- Java snippet
- algorithmInstance.matrixBuilder.maxWordDf
The maximum document frequency allowed for words as a fraction
of all documents. Words with document frequency larger than org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf
will be ignored.
For example, when org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf
is 0.4, words appearing in more than 40% of documents will
be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how
many documents they appear.
This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.
Another useful application of this attribute is when there is a need to generate only very
specific clusters, that is clusters containing small numbers of documents. This can be achieved
by setting org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf
to extremely low values: 0.1 or 0.05.
maximumMatrixSize
- Type
- Integer
- Default
- 37500
- Constraints
- value >= 5000
- Path
- matrixBuilder.maximumMatrixSize
- Java snippet
- algorithmInstance.matrixBuilder.maximumMatrixSize
Determines the maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.
termWeighting
- Type
- org.carrot2.text.vsm.TermWeighting
- Default
- LogTfIdfTermWeighting
- Path
- matrixBuilder.termWeighting
- Java snippet
- algorithmInstance.matrixBuilder.termWeighting
The method for calculating weight of words in the term-document matrices.
Available implementations:
-
LinearTfIdfTermWeighting
Calculates term-document matrix element values based on Linear Inverse Term Frequency.
-
LogTfIdfTermWeighting
Calculates term-document matrix element values based on Log Inverse Term Frequency.
-
TfTermWeighting
Calculates term-document matrix element values based on Term Frequency.
titleWordsBoost
- Type
- Double
- Default
- 2.0
- Constraints
- value >= 0.0 and value <= 10.0
- Path
- matrixBuilder.titleWordsBoost
- Java snippet
- algorithmInstance.matrixBuilder.titleWordsBoost
Gives more weight to words that appeared in title fields. The larger the value, the stronger boost the title words will receive.
factorizationFactory
- Type
- org.carrot2.math.matrix.MatrixFactorizationFactory
- Default
- NonnegativeMatrixFactorizationEDFactory
- Path
- matrixReducer.factorizationFactory
- Java snippet
- algorithmInstance.matrixReducer.factorizationFactory
The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels.
Available implementations:
-
KMeansMatrixFactorizationFactory
This kind of factorization is sometimes referred to as Concept Decomposition Factorization.
- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.KMeansMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
LocalNonnegativeMatrixFactorizationFactory
Performs matrix factorization using the Local Non-negative Matrix Factorization algorithm with minimization of the Kullback-Leibler divergence between A and UV' and multiplicative updating.
- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.LocalNonnegativeMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
NonnegativeMatrixFactorizationEDFactory
Performs matrix factorization using the Non-negative Matrix Factorization algorithm with minimization of Euclidean Distance between A and UV' and multiplicative updating.
- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationEDFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
NonnegativeMatrixFactorizationKLFactory
Performs matrix factorization using the Non-negative Matrix Factorization by minimization of Kullback-Leibler divergence between A and UV' and multiplicative updating.
- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationKLFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
PartialSingularValueDecompositionFactory
Performs matrix factorization using the Singular Value Decomposition algorithm.
maxIterations
- Type
- Integer
- Default
- 15
- Constraints
- value >= 1
- Path
- maxIterations
- Java snippet
- algorithmInstance.maxIterations
The maximum number of k-means iterations to perform.
partitionCount
- Type
- Integer
- Default
- 2
- Constraints
- value >= 2 and value <= 10
- Path
- partitionCount
- Java snippet
- algorithmInstance.partitionCount
The number of partitions to create at each k-means clustering iteration.
wordDfThreshold
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.wordDfThreshold
- Java snippet
- algorithmInstance.preprocessing.wordDfThreshold
Words appearing in fewer than dfThreshold
documents will be ignored.
queryHint
- Type
- String
- Default
- undefined
- Path
- queryHint
- Java snippet
- algorithmInstance.queryHint
The query is used as a hint to avoid trivial clusters.
useDimensionalityReduction
- Type
- Boolean
- Default
- true
- Path
- useDimensionalityReduction
- Java snippet
- algorithmInstance.useDimensionalityReduction
If true
, k-means will be applied on the
dimensionality-reduced term-document matrix with the number of dimensions being equal to twice
the number of requested clusters. If the number of dimensions is lower than the number of input
documents, reduction will not be performed. If false
, the k-means will be performed
directly on the original term-document matrix.