Lingo parameters

You can tune various aspects of Lingo clustering by changing some of the parameters of the algorithm.

Below is the list of algorithm parameters along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.

{"clusterBuilder": {"clusterMergingThreshold": 0.7,
"labelAssigner": {// 2 configuration variants available, choose one
"@type": "SimpleLabelAssigner"
"@type": "UniqueLabelAssigner"
},
"phraseLabelBoost": 1.5,
"phraseLengthPenaltyStart": 8,
"phraseLengthPenaltyStop": 8
},
"desiredClusterCount": 30,
"dictionaries": {"labelFilters": [],
"wordFilters": []
},
"matrixBuilder": {"boostFields": [],
"boostedFieldWeight": 2,
"maxWordDf": 0.9,
"maximumMatrixSize": 37500,
"termWeighting": {// 3 configuration variants available, choose one
"@type": "LinearTfIdfTermWeighting"
"@type": "LogTfIdfTermWeighting"
"@type": "TfTermWeighting"
}
},
"matrixReducer": {"factorizationFactory": {// 5 configuration variants available, choose one
"@type": "KMeansMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "LocalNonnegativeMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationEDFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationKLFactory",
"factorizationQuality": "HIGH"
"@type": "PartialSingularValueDecompositionFactory"
}
},
"preprocessing": {"documentAssigner": {"exactPhraseAssignment": false,
"minClusterSize": 2
},
"labelFilters": {"completeLabelFilter": {"enabled": true,
"labelOverrideThreshold": 0.65
},
"genitiveLabelFilter": {"enabled": true
},
"minLengthLabelFilter": {"enabled": true,
"minLength": 3
},
"numericLabelFilter": {"enabled": true
},
"queryLabelFilter": {"enabled": true
},
"stopLabelFilter": {"enabled": true
},
"stopWordLabelFilter": {"enabled": true
}
},
"phraseDfThreshold": 1,
"wordDfThreshold": 1
},
"queryHint": null,
"scoreWeight": 0
}

clusterBuilder

Type: org.carrot2.clustering.lingo.ClusterBuilder
Default: ClusterBuilder
Path: clusterBuilder
Java snippet: algorithmInstance.clusterBuilder

Configuration of the structure and labels of clusters.

{"clusterMergingThreshold": 0.7,
"labelAssigner": {// 2 configuration variants available, choose one
"@type": "SimpleLabelAssigner"
"@type": "UniqueLabelAssigner"
},
"phraseLabelBoost": 1.5,
"phraseLengthPenaltyStart": 8,
"phraseLengthPenaltyStop": 8
}

clusterMergingThreshold

Type: Double
Default: 0.7
Constraints: value >= 0.0 and value <= 1.0
Path: clusterBuilder.clusterMergingThreshold
Java snippet: algorithmInstance.clusterBuilder.clusterMergingThreshold

Percentage of overlap between two cluster's document sets at which to merge the clusters. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters.

labelAssigner

Type: org.carrot2.clustering.lingo.LabelAssigner
Default: UniqueLabelAssigner
Path: clusterBuilder.labelAssigner
Java snippet: algorithmInstance.clusterBuilder.labelAssigner

The method of assigning documents to labels when forming clusters.

Available implementations:

SimpleLabelAssigner

A simple and fast label assigner. For each base vector chooses the label that maximizes the base vector--label term vector cosine similarity. Different vectors can get the same label assigned, which means the number of final labels (after duplicate removal) may be smaller than the number of base vectors on input.
UniqueLabelAssigner

Assigns unique labels to each base vector using a greedy algorithm. For each base vector chooses the label that maximizes the base vector--label term vector cosine similarity and has not been previously selected. Once a label is selected, it will not be used to label any other vector. This algorithm does not create duplicate cluster labels, which usually means that this assignment method will create more clusters than org.carrot2.clustering.lingo.SimpleLabelAssigner. This method is slightly slower than org.carrot2.clustering.lingo.SimpleLabelAssigner.

phraseLabelBoost

Type: Double
Default: 1.5
Constraints: value >= 0.0 and value <= 10.0
Path: clusterBuilder.phraseLabelBoost
Java snippet: algorithmInstance.clusterBuilder.phraseLabelBoost

Weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels.

phraseLengthPenaltyStart

Type: Integer
Default: 8
Constraints: value >= 2 and value <= 8
Path: clusterBuilder.phraseLengthPenaltyStart
Java snippet: algorithmInstance.clusterBuilder.phraseLengthPenaltyStart

Phrase length at which the overlong multi-word labels should start to be penalized. Phrases of length smaller than phraseLengthPenaltyStart will not be penalized.

phraseLengthPenaltyStop

Type: Integer
Default: 8
Constraints: value >= 2 and value <= 8
Path: clusterBuilder.phraseLengthPenaltyStop
Java snippet: algorithmInstance.clusterBuilder.phraseLengthPenaltyStop

Phrase length at which the overlong multi-word labels should be removed completely. Phrases of length larger than phraseLengthPenaltyStop will be removed.

desiredClusterCount

Type: Integer
Default: 30
Constraints: value >= 2 and value <= 100
Path: desiredClusterCount
Java snippet: algorithmInstance.desiredClusterCount

Determines number of clusters to create. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportional to the value of this parameter, but may be different.

dictionaries

Type: org.carrot2.language.EphemeralDictionaries
Default: EphemeralDictionaries
Path: dictionaries
Java snippet: algorithmInstance.dictionaries

Per-request overrides of language components (dictionaries).

{"labelFilters": [],
"wordFilters": []
}

labelFilters

Type: org.carrot2.language.LabelFilterDictionary[]
Default: []
Path: dictionaries.labelFilters
Java snippet: algorithmInstance.dictionaries.labelFilters

Additional label filtering dictionaries (supplying cluster label filters that should be discarded from the output).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"labelFilters": [{
   "exact": ["Cluster Label 1", "Foo Bar"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

wordFilters

Type: org.carrot2.language.StopwordFilterDictionary[]
Default: []
Path: dictionaries.wordFilters
Java snippet: algorithmInstance.dictionaries.wordFilters

Additional stop word filtering dictionaries (supplying word filters that should be discarded from the input).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"wordFilters": [{
   "exact": ["word1", "word2"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

matrixBuilder

Type: org.carrot2.text.vsm.TermDocumentMatrixBuilder
Default: TermDocumentMatrixBuilder
Path: matrixBuilder
Java snippet: algorithmInstance.matrixBuilder

Configuration of the size and contents of the term-document matrix.

{"boostFields": [],
"boostedFieldWeight": 2,
"maxWordDf": 0.9,
"maximumMatrixSize": 37500,
"termWeighting": {// 3 configuration variants available, choose one
"@type": "LinearTfIdfTermWeighting"
"@type": "LogTfIdfTermWeighting"
"@type": "TfTermWeighting"
}
}

boostFields

Type: String[]
Default: []
Path: matrixBuilder.boostFields
Java snippet: algorithmInstance.matrixBuilder.boostFields

A list fields for which to apply extra weight. Content of fields provided in this parameter can be given more weight during clustering. You may want to boost, for example, the title field with the assumption that it accurately summarizes the content of the whole document.

boostedFieldWeight

Type: Double
Default: 2
Constraints: value >= 0.0 and value <= 10.0
Path: matrixBuilder.boostedFieldWeight
Java snippet: algorithmInstance.matrixBuilder.boostedFieldWeight

The extra weight to apply to words that appeared in boosted fields. The larger the value, the stronger the boost.

maxWordDf

Type: Double
Default: 0.9
Constraints: value >= 0.0 and value <= 1.0
Path: matrixBuilder.maxWordDf
Java snippet: algorithmInstance.matrixBuilder.maxWordDf

Maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf will be ignored. For example, when org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This parameter may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.

Another useful application of this parameter is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf to extremely low values: 0.1 or 0.05.

maximumMatrixSize

Type: Integer
Default: 37500
Constraints: value >= 5000
Path: matrixBuilder.maximumMatrixSize
Java snippet: algorithmInstance.matrixBuilder.maximumMatrixSize

Maximum number of elements the term-document matrix can have. The larger the allowed matrix size, the more accurate, time- and memory-consuming clustering.

termWeighting

Type: org.carrot2.text.vsm.TermWeighting
Default: LogTfIdfTermWeighting
Path: matrixBuilder.termWeighting
Java snippet: algorithmInstance.matrixBuilder.termWeighting

Method for calculating weights of words in the term-document matrices.

Available implementations:

LinearTfIdfTermWeighting

Calculates term-document matrix element values based on Linear Inverse Term Frequency.
LogTfIdfTermWeighting

Calculates term-document matrix element values based on Log Inverse Term Frequency.
TfTermWeighting

Calculates term-document matrix element values based on Term Frequency.

matrixReducer

Type: org.carrot2.text.vsm.TermDocumentMatrixReducer
Default: TermDocumentMatrixReducer
Path: matrixReducer
Java snippet: algorithmInstance.matrixReducer

Configuration of the matrix decomposition method to use for clustering.

{"factorizationFactory": {// 5 configuration variants available, choose one
"@type": "KMeansMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "LocalNonnegativeMatrixFactorizationFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationEDFactory",
"factorizationQuality": "HIGH"
"@type": "NonnegativeMatrixFactorizationKLFactory",
"factorizationQuality": "HIGH"
"@type": "PartialSingularValueDecompositionFactory"
}
}

factorizationFactory

Type: org.carrot2.math.matrix.MatrixFactorizationFactory
Default: NonnegativeMatrixFactorizationEDFactory
Path: matrixReducer.factorizationFactory
Java snippet: algorithmInstance.matrixReducer.factorizationFactory

Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels.

Available implementations:

KMeansMatrixFactorizationFactory

Performs matrix factorization using the k-means clustering algorithm. This kind of factorization is sometimes referred to as Concept Decomposition Factorization.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.KMeansMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
LocalNonnegativeMatrixFactorizationFactory

Performs matrix factorization using the Local Non-negative Matrix Factorization algorithm with minimization of the Kullback-Leibler divergence between A and UV' and multiplicative updating.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.LocalNonnegativeMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
NonnegativeMatrixFactorizationEDFactory

Performs matrix factorization using the Non-negative Matrix Factorization algorithm with minimization of Euclidean Distance between A and UV' and multiplicative updating.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.NonnegativeMatrixFactorizationEDFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
NonnegativeMatrixFactorizationKLFactory

Performs matrix factorization using the Non-negative Matrix Factorization by minimization of Kullback-Leibler divergence between A and UV' and multiplicative updating.
```
{"factorizationQuality": "HIGH"
}
```
factorizationQuality

Type

org.carrot2.math.matrix.FactorizationQuality

Default

HIGH

Constraints
value in [LOW, MEDIUM, HIGH]

Path

matrixReducer.factorizationFactory.factorizationQuality

Java snippet

((org.carrot2.math.matrix.NonnegativeMatrixFactorizationKLFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
PartialSingularValueDecompositionFactory

Performs matrix factorization using the Singular Value Decomposition algorithm.

preprocessing

Type: org.carrot2.text.preprocessing.CompletePreprocessingPipeline
Default: CompletePreprocessingPipeline
Path: preprocessing
Java snippet: algorithmInstance.preprocessing

Configuration of the text preprocessing stage.

{"documentAssigner": {"exactPhraseAssignment": false,
"minClusterSize": 2
},
"labelFilters": {"completeLabelFilter": {"enabled": true,
"labelOverrideThreshold": 0.65
},
"genitiveLabelFilter": {"enabled": true
},
"minLengthLabelFilter": {"enabled": true,
"minLength": 3
},
"numericLabelFilter": {"enabled": true
},
"queryLabelFilter": {"enabled": true
},
"stopLabelFilter": {"enabled": true
},
"stopWordLabelFilter": {"enabled": true
}
},
"phraseDfThreshold": 1,
"wordDfThreshold": 1
}

documentAssigner

Type: org.carrot2.text.preprocessing.DocumentAssigner
Default: DocumentAssigner
Path: preprocessing.documentAssigner
Java snippet: algorithmInstance.preprocessing.documentAssigner

Document assigner used by the algorithm, contains modifiable parameters..

{"exactPhraseAssignment": false,
"minClusterSize": 2
}

exactPhraseAssignment

Type: Boolean
Default: false
Path: preprocessing.documentAssigner.exactPhraseAssignment
Java snippet: algorithmInstance.preprocessing.documentAssigner.exactPhraseAssignment

Only exact phrase assignments. When set to true, clusters will contain only the documents that contain the cluster's label in its original form, including the order of words. Enabling this option will cause fewer documents to be put in clusters, increasing the precision of assignment, but also increasing the "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments.

minClusterSize

Type: Integer
Default: 2
Constraints: value >= 1 and value <= 100
Path: preprocessing.documentAssigner.minClusterSize
Java snippet: algorithmInstance.preprocessing.documentAssigner.minClusterSize

Minimum required number of documents in each cluster. Clusters containing fewer documents will not be created.

labelFilters

Type: org.carrot2.text.preprocessing.LabelFilterProcessor
Default: LabelFilterProcessor
Path: preprocessing.labelFilters
Java snippet: algorithmInstance.preprocessing.labelFilters

Label filtering is a composite of individual filters.

{"completeLabelFilter": {"enabled": true,
"labelOverrideThreshold": 0.65
},
"genitiveLabelFilter": {"enabled": true
},
"minLengthLabelFilter": {"enabled": true,
"minLength": 3
},
"numericLabelFilter": {"enabled": true
},
"queryLabelFilter": {"enabled": true
},
"stopLabelFilter": {"enabled": true
},
"stopWordLabelFilter": {"enabled": true
}
}

completeLabelFilter

Type: org.carrot2.text.preprocessing.filter.CompleteLabelFilter
Default: CompleteLabelFilter
Path: preprocessing.labelFilters.completeLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.completeLabelFilter

Truncated phrase filter for this processor.

{"enabled": true,
"labelOverrideThreshold": 0.65
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.completeLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.completeLabelFilter.enabled

Enables or disables this filter.

labelOverrideThreshold

Type: Double
Default: 0.65
Constraints: value >= 0.0 and value <= 1.0
Path: preprocessing.labelFilters.completeLabelFilter.labelOverrideThreshold
Java snippet: algorithmInstance.preprocessing.labelFilters.completeLabelFilter.labelOverrideThreshold

Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.

genitiveLabelFilter

Type: org.carrot2.text.preprocessing.filter.GenitiveLabelFilter
Default: GenitiveLabelFilter
Path: preprocessing.labelFilters.genitiveLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.genitiveLabelFilter

Genitive length label filter.

{"enabled": true
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.genitiveLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.genitiveLabelFilter.enabled

Enables or disables this filter.

minLengthLabelFilter

Type: org.carrot2.text.preprocessing.filter.MinLengthLabelFilter
Default: MinLengthLabelFilter
Path: preprocessing.labelFilters.minLengthLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter

Min length label filter.

{"enabled": true,
"minLength": 3
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.minLengthLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter.enabled

Enables or disables this filter.

minLength

Type: Integer
Default: 3
Path: preprocessing.labelFilters.minLengthLabelFilter.minLength
Java snippet: algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter.minLength

Minimum required label length, in characters, inclusive.

numericLabelFilter

Type: org.carrot2.text.preprocessing.filter.NumericLabelFilter
Default: NumericLabelFilter
Path: preprocessing.labelFilters.numericLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.numericLabelFilter

Numeric label filter for this processor.

{"enabled": true
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.numericLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.numericLabelFilter.enabled

Enables or disables this filter.

queryLabelFilter

Type: org.carrot2.text.preprocessing.filter.QueryLabelFilter
Default: QueryLabelFilter
Path: preprocessing.labelFilters.queryLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.queryLabelFilter

Query word label filter for this processor.

{"enabled": true
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.queryLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.queryLabelFilter.enabled

Enables or disables this filter.

stopLabelFilter

Type: org.carrot2.text.preprocessing.filter.StopLabelFilter
Default: StopLabelFilter
Path: preprocessing.labelFilters.stopLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.stopLabelFilter

Stop label filter.

{"enabled": true
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.stopLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.stopLabelFilter.enabled

Enables or disables this filter.

stopWordLabelFilter

Type: org.carrot2.text.preprocessing.filter.StopWordLabelFilter
Default: StopWordLabelFilter
Path: preprocessing.labelFilters.stopWordLabelFilter
Java snippet: algorithmInstance.preprocessing.labelFilters.stopWordLabelFilter

Stop word label filter for this processor.

{"enabled": true
}

enabled

Type: Boolean
Default: true
Path: preprocessing.labelFilters.stopWordLabelFilter.enabled
Java snippet: algorithmInstance.preprocessing.labelFilters.stopWordLabelFilter.enabled

Enables or disables this filter.

phraseDfThreshold

Type: Integer
Default: 1
Constraints: value >= 1 and value <= 100
Path: preprocessing.phraseDfThreshold
Java snippet: algorithmInstance.preprocessing.phraseDfThreshold

Phrase Document Frequency cut-off threshold. Phrases appearing in fewer than phraseDfThreshold documents will be ignored.

wordDfThreshold

Type: Integer
Default: 1
Constraints: value >= 1 and value <= 100
Path: preprocessing.wordDfThreshold
Java snippet: algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency cut-off threshold. Words appearing in fewer than wordDfThreshold documents will be ignored.

queryHint

Type: String
Default: null
Path: queryHint
Java snippet: algorithmInstance.queryHint

Query terms used to retrieve documents being clustered. The query is used as a hint to avoid creating trivial clusters consisting only of query words.

scoreWeight

Type: Double
Default: 0
Constraints: value >= 0.0 and value <= 1.0
Path: scoreWeight
Java snippet: algorithmInstance.scoreWeight

Balance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score.

previous article
REST API Reference

next article
STC parameters

API elements

Sections and content

Lingo parameters

clusterBuilder

clusterMergingThreshold

labelAssigner

phraseLabelBoost

phraseLengthPenaltyStart

phraseLengthPenaltyStop

desiredClusterCount

dictionaries

labelFilters

wordFilters

matrixBuilder

boostFields

boostedFieldWeight

maxWordDf

maximumMatrixSize

termWeighting

matrixReducer

factorizationFactory

factorizationQuality

factorizationQuality

factorizationQuality

factorizationQuality

preprocessing

documentAssigner

exactPhraseAssignment

minClusterSize

labelFilters

completeLabelFilter

enabled

labelOverrideThreshold

genitiveLabelFilter

enabled

minLengthLabelFilter

enabled

minLength

numericLabelFilter

enabled

queryLabelFilter

enabled

stopLabelFilter

enabled

stopWordLabelFilter

enabled

phraseDfThreshold

wordDfThreshold

queryHint

scoreWeight