Bisecting k-means algorithm attributes

You can tune various aspects of bisecting k-means clustering by changing some of the attributes of the algorithm.

Below is the list of algorithm attributes along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.

{
"matrixBuilder": {
"maxWordDf": 0.9,
"termWeighting": {
// 3 configuration variants available, choose one
"@type": "LinearTfIdfTermWeighting"
"@type": "LogTfIdfTermWeighting"
"@type": "TfTermWeighting"
},
},
"matrixReducer": {
"factorizationFactory": {
// 5 configuration variants available, choose one
"@type": "KMeansMatrixFactorizationFactory",
"@type": "LocalNonnegativeMatrixFactorizationFactory",
"@type": "NonnegativeMatrixFactorizationEDFactory",
"@type": "NonnegativeMatrixFactorizationKLFactory",
"@type": "PartialSingularValueDecompositionFactory"
}
},
"preprocessing": {},
"queryHint": null,
}

clusterCount

Type
Integer
Default
25
Constraints
value >= 2
Path
clusterCount
Java snippet
algorithmInstance.clusterCount

The algorithm will create at most the specified number of clusters.

labelCount

Type
Integer
Default
3
Constraints
value >= 1 and value <= 10
Path
labelCount
Java snippet
algorithmInstance.labelCount

The minimum number of labels to return for each cluster.

maxWordDf

Type
Double
Default
0.9
Constraints
value >= 0.0 and value <= 1.0
Path
matrixBuilder.maxWordDf
Java snippet
algorithmInstance.matrixBuilder.maxWordDf

The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf will be ignored. For example, when org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf to extremely low values: 0.1 or 0.05.

maximumMatrixSize

Type
Integer
Default
37500
Constraints
value >= 5000
Path
matrixBuilder.maximumMatrixSize
Java snippet
algorithmInstance.matrixBuilder.maximumMatrixSize

Determines the maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.

termWeighting

Type
org.carrot2.text.vsm.TermWeighting
Default
LogTfIdfTermWeighting
Path
matrixBuilder.termWeighting
Java snippet
algorithmInstance.matrixBuilder.termWeighting

The method for calculating weight of words in the term-document matrices.

Available implementations:

  1. LinearTfIdfTermWeighting

    Calculates term-document matrix element values based on Linear Inverse Term Frequency.

  2. LogTfIdfTermWeighting

    Calculates term-document matrix element values based on Log Inverse Term Frequency.

  3. TfTermWeighting

    Calculates term-document matrix element values based on Term Frequency.

titleWordsBoost

Type
Double
Default
2.0
Constraints
value >= 0.0 and value <= 10.0
Path
matrixBuilder.titleWordsBoost
Java snippet
algorithmInstance.matrixBuilder.titleWordsBoost

Gives more weight to words that appeared in title fields. The larger the value, the stronger boost the title words will receive.

factorizationFactory

Type
org.carrot2.math.matrix.MatrixFactorizationFactory
Default
NonnegativeMatrixFactorizationEDFactory
Path
matrixReducer.factorizationFactory
Java snippet
algorithmInstance.matrixReducer.factorizationFactory

The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels.

Available implementations:

  1. KMeansMatrixFactorizationFactory

    This kind of factorization is sometimes referred to as Concept Decomposition Factorization.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.KMeansMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  2. LocalNonnegativeMatrixFactorizationFactory

    Performs matrix factorization using the Local Non-negative Matrix Factorization algorithm with minimization of the Kullback-Leibler divergence between A and UV' and multiplicative updating.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.LocalNonnegativeMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  3. NonnegativeMatrixFactorizationEDFactory

    Performs matrix factorization using the Non-negative Matrix Factorization algorithm with minimization of Euclidean Distance between A and UV' and multiplicative updating.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationEDFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  4. NonnegativeMatrixFactorizationKLFactory

    Performs matrix factorization using the Non-negative Matrix Factorization by minimization of Kullback-Leibler divergence between A and UV' and multiplicative updating.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationKLFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  5. PartialSingularValueDecompositionFactory

    Performs matrix factorization using the Singular Value Decomposition algorithm.

maxIterations

Type
Integer
Default
15
Constraints
value >= 1
Path
maxIterations
Java snippet
algorithmInstance.maxIterations

The maximum number of k-means iterations to perform.

partitionCount

Type
Integer
Default
2
Constraints
value >= 2 and value <= 10
Path
partitionCount
Java snippet
algorithmInstance.partitionCount

The number of partitions to create at each k-means clustering iteration.

wordDfThreshold

Type
Integer
Default
1
Constraints
value >= 1 and value <= 100
Path
preprocessing.wordDfThreshold
Java snippet
algorithmInstance.preprocessing.wordDfThreshold

Words appearing in fewer than dfThreshold documents will be ignored.

queryHint

Type
String
Default
undefined
Path
queryHint
Java snippet
algorithmInstance.queryHint

The query is used as a hint to avoid trivial clusters.

useDimensionalityReduction

Type
Boolean
Default
true
Path
useDimensionalityReduction
Java snippet
algorithmInstance.useDimensionalityReduction

If true, k-means will be applied on the dimensionality-reduced term-document matrix with the number of dimensions being equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. If false, the k-means will be performed directly on the original term-document matrix.