k-means parameters

You can tune various aspects of bisecting k-means clustering by changing some of the parameters of the algorithm.

Below is the list of algorithm parameters along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.

{"clusterCount": 25,
"dictionaries": {"labelFilters": [],
"wordFilters": []
},
"labelCount": 3,
"matrixBuilder": {"boostFields": [],
"boostedFieldWeight": 2,
"maxWordDf": 0.9,
"maximumMatrixSize": 37500,
"termWeighting": {// 3 configuration variants available, choose one
"@type": "LinearTfIdfTermWeighting"
"@type": "LogTfIdfTermWeighting"
"@type": "TfTermWeighting"
}
},
"matrixReducer": {"factorizationFactory": {// 5 configuration variants available, choose one
"@type": "KMeansMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "LocalNonnegativeMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationEDFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationKLFactory",
"factorizationQuality": "HIGH"
"@type": "PartialSingularValueDecompositionFactory"
}
},
"maxIterations": 15,
"partitionCount": 2,
"preprocessing": {"wordDfThreshold": 1
},
"queryHint": null,
"useDimensionalityReduction": true
}

clusterCount

Type: Integer
Default: 25
Constraints: value >= 2
Path: clusterCount
Java snippet: algorithmInstance.clusterCount

Number of clusters to create. The algorithm will create at most the specified number of clusters.

dictionaries

Type: org.carrot2.language.EphemeralDictionaries
Default: EphemeralDictionaries
Path: dictionaries
Java snippet: algorithmInstance.dictionaries

Per-request overrides of language components (dictionaries).

{"labelFilters": [],
"wordFilters": []
}

labelFilters

Type: org.carrot2.language.LabelFilterDictionary[]
Default: []
Path: dictionaries.labelFilters
Java snippet: algorithmInstance.dictionaries.labelFilters

Additional label filtering dictionaries (supplying cluster label filters that should be discarded from the output).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"labelFilters": [{
   "exact": ["Cluster Label 1", "Foo Bar"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

wordFilters

Type: org.carrot2.language.StopwordFilterDictionary[]
Default: []
Path: dictionaries.wordFilters
Java snippet: algorithmInstance.dictionaries.wordFilters

Additional stop word filtering dictionaries (supplying word filters that should be discarded from the input).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"wordFilters": [{
   "exact": ["word1", "word2"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

labelCount

Type: Integer
Default: 3
Constraints: value >= 1 and value <= 10
Path: labelCount
Java snippet: algorithmInstance.labelCount

Minimum number of labels to return for each cluster.

matrixBuilder

Type: org.carrot2.text.vsm.TermDocumentMatrixBuilder
Default: TermDocumentMatrixBuilder
Path: matrixBuilder
Java snippet: algorithmInstance.matrixBuilder

Configuration of the size and contents of the term-document matrix.

{"boostFields": [],
"boostedFieldWeight": 2,
"maxWordDf": 0.9,
"maximumMatrixSize": 37500,
"termWeighting": {// 3 configuration variants available, choose one
"@type": "LinearTfIdfTermWeighting"
"@type": "LogTfIdfTermWeighting"
"@type": "TfTermWeighting"
}
}

boostFields

Type: String[]
Default: []
Path: matrixBuilder.boostFields
Java snippet: algorithmInstance.matrixBuilder.boostFields

A list fields for which to apply extra weight. Content of fields provided in this parameter can be given more weight during clustering. You may want to boost, for example, the title field with the assumption that it accurately summarizes the content of the whole document.

boostedFieldWeight

Type: Double
Default: 2
Constraints: value >= 0.0 and value <= 10.0
Path: matrixBuilder.boostedFieldWeight
Java snippet: algorithmInstance.matrixBuilder.boostedFieldWeight

The extra weight to apply to words that appeared in boosted fields. The larger the value, the stronger the boost.

maxWordDf

Type: Double
Default: 0.9
Constraints: value >= 0.0 and value <= 1.0
Path: matrixBuilder.maxWordDf
Java snippet: algorithmInstance.matrixBuilder.maxWordDf

Maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf will be ignored. For example, when org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This parameter may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.

Another useful application of this parameter is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf to extremely low values: 0.1 or 0.05.

maximumMatrixSize

Type: Integer
Default: 37500
Constraints: value >= 5000
Path: matrixBuilder.maximumMatrixSize
Java snippet: algorithmInstance.matrixBuilder.maximumMatrixSize

Maximum number of elements the term-document matrix can have. The larger the allowed matrix size, the more accurate, time- and memory-consuming clustering.

termWeighting

Type: org.carrot2.text.vsm.TermWeighting
Default: LogTfIdfTermWeighting
Path: matrixBuilder.termWeighting
Java snippet: algorithmInstance.matrixBuilder.termWeighting

Method for calculating weights of words in the term-document matrices.

Available implementations:

LinearTfIdfTermWeighting

Calculates term-document matrix element values based on Linear Inverse Term Frequency.
LogTfIdfTermWeighting

Calculates term-document matrix element values based on Log Inverse Term Frequency.
TfTermWeighting

Calculates term-document matrix element values based on Term Frequency.

matrixReducer

Type: org.carrot2.text.vsm.TermDocumentMatrixReducer
Default: TermDocumentMatrixReducer
Path: matrixReducer
Java snippet: algorithmInstance.matrixReducer

Configuration of the matrix decomposition method to use for clustering.

{"factorizationFactory": {// 5 configuration variants available, choose one
"@type": "KMeansMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "LocalNonnegativeMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationEDFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationKLFactory",
"factorizationQuality": "HIGH"
"@type": "PartialSingularValueDecompositionFactory"
}
}

factorizationFactory

Type: org.carrot2.math.matrix.MatrixFactorizationFactory
Default: NonnegativeMatrixFactorizationEDFactory
Path: matrixReducer.factorizationFactory
Java snippet: algorithmInstance.matrixReducer.factorizationFactory

Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels.

Available implementations:

KMeansMatrixFactorizationFactory

Performs matrix factorization using the k-means clustering algorithm. This kind of factorization is sometimes referred to as Concept Decomposition Factorization.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.KMeansMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
LocalNonnegativeMatrixFactorizationFactory

Performs matrix factorization using the Local Non-negative Matrix Factorization algorithm with minimization of the Kullback-Leibler divergence between A and UV' and multiplicative updating.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.LocalNonnegativeMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
NonnegativeMatrixFactorizationEDFactory

Performs matrix factorization using the Non-negative Matrix Factorization algorithm with minimization of Euclidean Distance between A and UV' and multiplicative updating.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.NonnegativeMatrixFactorizationEDFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
NonnegativeMatrixFactorizationKLFactory

Performs matrix factorization using the Non-negative Matrix Factorization by minimization of Kullback-Leibler divergence between A and UV' and multiplicative updating.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.NonnegativeMatrixFactorizationKLFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
PartialSingularValueDecompositionFactory

Performs matrix factorization using the Singular Value Decomposition algorithm.

maxIterations

Type: Integer
Default: 15
Constraints: value >= 1
Path: maxIterations
Java snippet: algorithmInstance.maxIterations

Maximum number of k-means iterations to perform.

partitionCount

Type: Integer
Default: 2
Constraints: value >= 2 and value <= 10
Path: partitionCount
Java snippet: algorithmInstance.partitionCount

Number of partitions to create at each k-means clustering iteration.

preprocessing

Type: org.carrot2.text.preprocessing.BasicPreprocessingPipeline
Default: BasicPreprocessingPipeline
Path: preprocessing
Java snippet: algorithmInstance.preprocessing

Configuration of the text preprocessing stage.

{"wordDfThreshold": 1
}

wordDfThreshold

Type: Integer
Default: 1
Constraints: value >= 1 and value <= 100
Path: preprocessing.wordDfThreshold
Java snippet: algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

queryHint

Type: String
Default: null
Path: queryHint
Java snippet: algorithmInstance.queryHint

Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.

useDimensionalityReduction

Type: Boolean
Default: true
Path: useDimensionalityReduction
Java snippet: algorithmInstance.useDimensionalityReduction

If enabled, k-means will be applied on the dimensionality-reduced term-document matrix. The number of dimensions will be equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. If disabled, the k-means will be performed directly on the original term-document matrix.

previous article
STC parameters

next article
Carrot2 4.3.x

API elements

Sections and content

k-means parameters

clusterCount

dictionaries

labelFilters

wordFilters

labelCount

matrixBuilder

boostFields

boostedFieldWeight

maxWordDf

maximumMatrixSize

termWeighting

matrixReducer

factorizationFactory

factorizationQuality

factorizationQuality

factorizationQuality

factorizationQuality

maxIterations

partitionCount

preprocessing

wordDfThreshold

queryHint

useDimensionalityReduction