Lingo parameters
You can tune various aspects of Lingo clustering by changing some of the parameters of the algorithm.
Below is the list of algorithm parameters along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.
{"clusterBuilder": {"labelAssigner": {// 2 configuration variants available, choose one"@type": "SimpleLabelAssigner""@type": "UniqueLabelAssigner"},},"matrixBuilder": {"termWeighting": {// 3 configuration variants available, choose one"@type": "LinearTfIdfTermWeighting""@type": "LogTfIdfTermWeighting""@type": "TfTermWeighting"}},"matrixReducer": {"factorizationFactory": {// 5 configuration variants available, choose one"@type": "KMeansMatrixFactorizationFactory","factorizationQuality": "HIGH""@type": "LocalNonnegativeMatrixFactorizationFactory","factorizationQuality": "HIGH""@type": "NonnegativeMatrixFactorizationEDFactory","factorizationQuality": "HIGH""@type": "NonnegativeMatrixFactorizationKLFactory","factorizationQuality": "HIGH""@type": "PartialSingularValueDecompositionFactory"}},"scoreWeight": 0}
clusterBuilder
- Type
- org.carrot2.clustering.lingo.ClusterBuilder
- Default
- ClusterBuilder
- Path
- clusterBuilder
- Java snippet
- algorithmInstance.clusterBuilder
Configuration of the structure and labels of clusters.
{"labelAssigner": {// 2 configuration variants available, choose one"@type": "SimpleLabelAssigner""@type": "UniqueLabelAssigner"},}
clusterMergingThreshold
- Type
- Double
- Default
- 0.7
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- clusterBuilder.clusterMergingThreshold
- Java snippet
- algorithmInstance.clusterBuilder.clusterMergingThreshold
Percentage of overlap between two cluster's document sets at which to merge the clusters. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters.
labelAssigner
- Type
- org.carrot2.clustering.lingo.LabelAssigner
- Default
- UniqueLabelAssigner
- Path
- clusterBuilder.labelAssigner
- Java snippet
- algorithmInstance.clusterBuilder.labelAssigner
The method of assigning documents to labels when forming clusters.
Available implementations:
-
SimpleLabelAssigner
A simple and fast label assigner. For each base vector chooses the label that maximizes the base vector--label term vector cosine similarity. Different vectors can get the same label assigned, which means the number of final labels (after duplicate removal) may be smaller than the number of base vectors on input.
-
UniqueLabelAssigner
Assigns unique labels to each base vector using a greedy algorithm. For each base vector chooses the label that maximizes the base vector--label term vector cosine similarity and has not been previously selected. Once a label is selected, it will not be used to label any other vector. This algorithm does not create duplicate cluster labels, which usually means that this assignment method will create more clusters than
org.carrot2.clustering.lingo.SimpleLabelAssigner
. This method is slightly slower thanorg.carrot2.clustering.lingo.SimpleLabelAssigner
.
phraseLabelBoost
- Type
- Double
- Default
- 1.5
- Constraints
- value >= 0.0 and value <= 10.0
- Path
- clusterBuilder.phraseLabelBoost
- Java snippet
- algorithmInstance.clusterBuilder.phraseLabelBoost
Weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels.
phraseLengthPenaltyStart
- Type
- Integer
- Default
- 8
- Constraints
- value >= 2 and value <= 8
- Path
- clusterBuilder.phraseLengthPenaltyStart
- Java snippet
- algorithmInstance.clusterBuilder.phraseLengthPenaltyStart
Phrase length at which the overlong multi-word labels should start to be penalized. Phrases of
length smaller than phraseLengthPenaltyStart
will not be penalized.
phraseLengthPenaltyStop
- Type
- Integer
- Default
- 8
- Constraints
- value >= 2 and value <= 8
- Path
- clusterBuilder.phraseLengthPenaltyStop
- Java snippet
- algorithmInstance.clusterBuilder.phraseLengthPenaltyStop
Phrase length at which the overlong multi-word labels should be removed completely. Phrases of
length larger than phraseLengthPenaltyStop
will be removed.
desiredClusterCount
- Type
- Integer
- Default
- 30
- Constraints
- value >= 2 and value <= 100
- Path
- desiredClusterCount
- Java snippet
- algorithmInstance.desiredClusterCount
Determines number of clusters to create. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportional to the value of this parameter, but may be different.
dictionaries
- Type
- org.carrot2.language.EphemeralDictionaries
- Default
- EphemeralDictionaries
- Path
- dictionaries
- Java snippet
- algorithmInstance.dictionaries
Per-request overrides of language components (dictionaries).
{}
labelFilters
- Type
- org.carrot2.language.LabelFilterDictionary[]
- Default
- []
- Path
- dictionaries.labelFilters
- Java snippet
- algorithmInstance.dictionaries.labelFilters
Additional label filtering dictionaries (supplying cluster label filters that should be discarded from the output).
One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl
supports exact string matching and regular expression patterns.
REST-style example using the default implementation:
"labelFilters": [{
"exact": ["Cluster Label 1", "Foo Bar"],
"regexp": [
"(?).+pattern1.+",
"(?).+[0-9]{2}.+"
]
}]
wordFilters
- Type
- org.carrot2.language.StopwordFilterDictionary[]
- Default
- []
- Path
- dictionaries.wordFilters
- Java snippet
- algorithmInstance.dictionaries.wordFilters
Additional stop word filtering dictionaries (supplying word filters that should be discarded from the input).
One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl
supports exact string matching and regular expression patterns.
REST-style example using the default implementation:
"wordFilters": [{
"exact": ["word1", "word2"],
"regexp": [
"(?).+pattern1.+",
"(?).+[0-9]{2}.+"
]
}]
matrixBuilder
- Type
- org.carrot2.text.vsm.TermDocumentMatrixBuilder
- Default
- TermDocumentMatrixBuilder
- Path
- matrixBuilder
- Java snippet
- algorithmInstance.matrixBuilder
Configuration of the size and contents of the term-document matrix.
{"termWeighting": {// 3 configuration variants available, choose one"@type": "LinearTfIdfTermWeighting""@type": "LogTfIdfTermWeighting""@type": "TfTermWeighting"}}
boostFields
- Type
- String[]
- Default
- []
- Path
- matrixBuilder.boostFields
- Java snippet
- algorithmInstance.matrixBuilder.boostFields
A list fields for which to apply extra weight. Content of fields provided in this parameter can be given more weight during clustering. You may want to boost, for example, the title field with the assumption that it accurately summarizes the content of the whole document.
boostedFieldWeight
- Type
- Double
- Default
- 2
- Constraints
- value >= 0.0 and value <= 10.0
- Path
- matrixBuilder.boostedFieldWeight
- Java snippet
- algorithmInstance.matrixBuilder.boostedFieldWeight
The extra weight to apply to words that appeared in boosted fields. The larger the value, the stronger the boost.
maxWordDf
- Type
- Double
- Default
- 0.9
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- matrixBuilder.maxWordDf
- Java snippet
- algorithmInstance.matrixBuilder.maxWordDf
Maximum document frequency allowed for words as a fraction of all documents. Words with
document frequency larger than org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf
will be ignored. For example, when org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf
is 0.4, words appearing in more than 40% of documents will be be ignored. A value
of 1.0 means that all words will be taken into account, no matter in how many documents they
appear.
This parameter may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.
Another useful application of this parameter is when there is a need to generate only very
specific clusters, that is clusters containing small numbers of documents. This can be achieved
by setting org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf
to extremely low values: 0.1 or 0.05.
maximumMatrixSize
- Type
- Integer
- Default
- 37500
- Constraints
- value >= 5000
- Path
- matrixBuilder.maximumMatrixSize
- Java snippet
- algorithmInstance.matrixBuilder.maximumMatrixSize
Maximum number of elements the term-document matrix can have. The larger the allowed matrix size, the more accurate, time- and memory-consuming clustering.
termWeighting
- Type
- org.carrot2.text.vsm.TermWeighting
- Default
- LogTfIdfTermWeighting
- Path
- matrixBuilder.termWeighting
- Java snippet
- algorithmInstance.matrixBuilder.termWeighting
Method for calculating weights of words in the term-document matrices.
Available implementations:
-
LinearTfIdfTermWeighting
Calculates term-document matrix element values based on Linear Inverse Term Frequency.
-
LogTfIdfTermWeighting
Calculates term-document matrix element values based on Log Inverse Term Frequency.
-
TfTermWeighting
Calculates term-document matrix element values based on Term Frequency.
matrixReducer
- Type
- org.carrot2.text.vsm.TermDocumentMatrixReducer
- Default
- TermDocumentMatrixReducer
- Path
- matrixReducer
- Java snippet
- algorithmInstance.matrixReducer
Configuration of the matrix decomposition method to use for clustering.
{"factorizationFactory": {// 5 configuration variants available, choose one"@type": "KMeansMatrixFactorizationFactory","factorizationQuality": "HIGH""@type": "LocalNonnegativeMatrixFactorizationFactory","factorizationQuality": "HIGH""@type": "NonnegativeMatrixFactorizationEDFactory","factorizationQuality": "HIGH""@type": "NonnegativeMatrixFactorizationKLFactory","factorizationQuality": "HIGH""@type": "PartialSingularValueDecompositionFactory"}}
factorizationFactory
- Type
- org.carrot2.math.matrix.MatrixFactorizationFactory
- Default
- NonnegativeMatrixFactorizationEDFactory
- Path
- matrixReducer.factorizationFactory
- Java snippet
- algorithmInstance.matrixReducer.factorizationFactory
Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels.
Available implementations:
-
KMeansMatrixFactorizationFactory
Performs matrix factorization using the k-means clustering algorithm. This kind of factorization is sometimes referred to as Concept Decomposition Factorization.
{
"factorizationQuality": "HIGH"}- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.KMeansMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
LocalNonnegativeMatrixFactorizationFactory
Performs matrix factorization using the Local Non-negative Matrix Factorization algorithm with minimization of the Kullback-Leibler divergence between A and UV' and multiplicative updating.
{
"factorizationQuality": "HIGH"}- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.LocalNonnegativeMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
NonnegativeMatrixFactorizationEDFactory
Performs matrix factorization using the Non-negative Matrix Factorization algorithm with minimization of Euclidean Distance between A and UV' and multiplicative updating.
{
"factorizationQuality": "HIGH"}- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationEDFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
NonnegativeMatrixFactorizationKLFactory
Performs matrix factorization using the Non-negative Matrix Factorization by minimization of Kullback-Leibler divergence between A and UV' and multiplicative updating.
{
"factorizationQuality": "HIGH"}- Type
- org.carrot2.math.matrix.FactorizationQuality
- Default
- HIGH
- Constraints
- value in [LOW, MEDIUM, HIGH]
- Path
- matrixReducer.factorizationFactory.factorizationQuality
- Java snippet
- ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationKLFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality
Number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
-
PartialSingularValueDecompositionFactory
Performs matrix factorization using the Singular Value Decomposition algorithm.
preprocessing
- Type
- org.carrot2.text.preprocessing.CompletePreprocessingPipeline
- Default
- CompletePreprocessingPipeline
- Path
- preprocessing
- Java snippet
- algorithmInstance.preprocessing
Configuration of the text preprocessing stage.
{"wordDfThreshold": 1}
documentAssigner
- Type
- org.carrot2.text.preprocessing.DocumentAssigner
- Default
- DocumentAssigner
- Path
- preprocessing.documentAssigner
- Java snippet
- algorithmInstance.preprocessing.documentAssigner
Document assigner used by the algorithm, contains modifiable parameters..
{"minClusterSize": 2}
exactPhraseAssignment
- Type
- Boolean
- Default
- false
- Path
- preprocessing.documentAssigner.exactPhraseAssignment
- Java snippet
- algorithmInstance.preprocessing.documentAssigner.exactPhraseAssignment
Only exact phrase assignments. When set to true
, clusters will contain only the
documents that contain the cluster's label in its original form, including the order of words.
Enabling this option will cause fewer documents to be put in clusters, increasing the precision
of assignment, but also increasing the "Other Topics" group. Disabling this option will cause
more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but
also lower the precision of cluster-document assignments.
minClusterSize
- Type
- Integer
- Default
- 2
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.documentAssigner.minClusterSize
- Java snippet
- algorithmInstance.preprocessing.documentAssigner.minClusterSize
Minimum required number of documents in each cluster. Clusters containing fewer documents will not be created.
labelFilters
- Type
- org.carrot2.text.preprocessing.LabelFilterProcessor
- Default
- LabelFilterProcessor
- Path
- preprocessing.labelFilters
- Java snippet
- algorithmInstance.preprocessing.labelFilters
Label filtering is a composite of individual filters.
{}
completeLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.CompleteLabelFilter
- Default
- CompleteLabelFilter
- Path
- preprocessing.labelFilters.completeLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.completeLabelFilter
Truncated phrase filter for this processor.
{"labelOverrideThreshold": 0.65}
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.completeLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.completeLabelFilter.enabled
Enables or disables this filter.
labelOverrideThreshold
- Type
- Double
- Default
- 0.65
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- preprocessing.labelFilters.completeLabelFilter.labelOverrideThreshold
- Java snippet
- algorithmInstance.preprocessing.labelFilters.completeLabelFilter.labelOverrideThreshold
Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.
genitiveLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.GenitiveLabelFilter
- Default
- GenitiveLabelFilter
- Path
- preprocessing.labelFilters.genitiveLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.genitiveLabelFilter
Genitive length label filter.
{"enabled": true}
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.genitiveLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.genitiveLabelFilter.enabled
Enables or disables this filter.
minLengthLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.MinLengthLabelFilter
- Default
- MinLengthLabelFilter
- Path
- preprocessing.labelFilters.minLengthLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter
Min length label filter.
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.minLengthLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter.enabled
Enables or disables this filter.
minLength
- Type
- Integer
- Default
- 3
- Path
- preprocessing.labelFilters.minLengthLabelFilter.minLength
- Java snippet
- algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter.minLength
Minimum required label length, in characters, inclusive.
numericLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.NumericLabelFilter
- Default
- NumericLabelFilter
- Path
- preprocessing.labelFilters.numericLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.numericLabelFilter
Numeric label filter for this processor.
{"enabled": true}
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.numericLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.numericLabelFilter.enabled
Enables or disables this filter.
queryLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.QueryLabelFilter
- Default
- QueryLabelFilter
- Path
- preprocessing.labelFilters.queryLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.queryLabelFilter
Query word label filter for this processor.
{"enabled": true}
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.queryLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.queryLabelFilter.enabled
Enables or disables this filter.
stopLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.StopLabelFilter
- Default
- StopLabelFilter
- Path
- preprocessing.labelFilters.stopLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.stopLabelFilter
Stop label filter.
{"enabled": true}
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.stopLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.stopLabelFilter.enabled
Enables or disables this filter.
stopWordLabelFilter
- Type
- org.carrot2.text.preprocessing.filter.StopWordLabelFilter
- Default
- StopWordLabelFilter
- Path
- preprocessing.labelFilters.stopWordLabelFilter
- Java snippet
- algorithmInstance.preprocessing.labelFilters.stopWordLabelFilter
Stop word label filter for this processor.
{"enabled": true}
enabled
- Type
- Boolean
- Default
- true
- Path
- preprocessing.labelFilters.stopWordLabelFilter.enabled
- Java snippet
- algorithmInstance.preprocessing.labelFilters.stopWordLabelFilter.enabled
Enables or disables this filter.
phraseDfThreshold
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.phraseDfThreshold
- Java snippet
- algorithmInstance.preprocessing.phraseDfThreshold
Phrase Document Frequency cut-off threshold. Phrases appearing in fewer than
phraseDfThreshold
documents will be ignored.
wordDfThreshold
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.wordDfThreshold
- Java snippet
- algorithmInstance.preprocessing.wordDfThreshold
Word Document Frequency cut-off threshold. Words appearing in fewer than wordDfThreshold
documents will be ignored.
queryHint
- Type
- String
- Default
- null
- Path
- queryHint
- Java snippet
- algorithmInstance.queryHint
Query terms used to retrieve documents being clustered. The query is used as a hint to avoid creating trivial clusters consisting only of query words.
scoreWeight
- Type
- Double
- Default
- 0
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoreWeight
- Java snippet
- algorithmInstance.scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score.