Lingo attributes

You can tune various aspects of Lingo clustering by changing some of the attributes of the algorithm.

Below is the list of algorithm attributes along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.

{
"clusterBuilder": {
"labelAssigner": {
// 2 configuration variants available, choose one
"@type": "SimpleLabelAssigner"
"@type": "UniqueLabelAssigner"
},
},
"matrixBuilder": {
"maxWordDf": 0.9,
"termWeighting": {
// 3 configuration variants available, choose one
"@type": "LinearTfIdfTermWeighting"
"@type": "LogTfIdfTermWeighting"
"@type": "TfTermWeighting"
}
},
"matrixReducer": {
"factorizationFactory": {
// 5 configuration variants available, choose one
"@type": "KMeansMatrixFactorizationFactory",
"@type": "LocalNonnegativeMatrixFactorizationFactory",
"@type": "NonnegativeMatrixFactorizationEDFactory",
"@type": "NonnegativeMatrixFactorizationKLFactory",
"@type": "PartialSingularValueDecompositionFactory"
}
},
"preprocessing": {
"documentAssigner": {},
"labelFilters": {
"completeLabelFilter": {},
"genitiveLabelFilter": {},
"minLengthLabelFilter": {},
"numericLabelFilter": {},
"queryLabelFilter": {},
"stopLabelFilter": {},
"stopWordLabelFilter": {}
},
},
"queryHint": null,
}

clusterMergingThreshold

Type
Double
Default
0.7
Constraints
value >= 0.0 and value <= 1.0
Path
clusterBuilder.clusterMergingThreshold
Java snippet
algorithmInstance.clusterBuilder.clusterMergingThreshold

Cluster merging threshold. The percentage overlap between two cluster's documents required for the clusters to be merged into one cluster. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters.

labelAssigner

Type
org.carrot2.clustering.lingo.LabelAssigner
Default
UniqueLabelAssigner
Path
clusterBuilder.labelAssigner
Java snippet
algorithmInstance.clusterBuilder.labelAssigner

Cluster label assignment method.

Available implementations:

  1. SimpleLabelAssigner

    A simple and fast label assigner. For each base vector chooses the label that maximizes the base vector--label term vector cosine similarity. Different vectors can get the same label assigned, which means the number of final labels (after duplicate removal) may be smaller than the number of base vectors on input.

  2. UniqueLabelAssigner

    Assigns unique labels to each base vector using a greedy algorithm. For each base vector chooses the label that maximizes the base vector--label term vector cosine similarity and has not been previously selected. Once a label is selected, it will not be used to label any other vector. This algorithm does not create duplicate cluster labels, which usually means that this assignment method will create more clusters than org.carrot2.clustering.lingo.SimpleLabelAssigner. This method is slightly slower than org.carrot2.clustering.lingo.SimpleLabelAssigner.

phraseLabelBoost

Type
Double
Default
1.5
Constraints
value >= 0.0 and value <= 10.0
Path
clusterBuilder.phraseLabelBoost
Java snippet
algorithmInstance.clusterBuilder.phraseLabelBoost

Phrase label boost. The weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels.

phraseLengthPenaltyStart

Type
Integer
Default
8
Constraints
value >= 2 and value <= 8
Path
clusterBuilder.phraseLengthPenaltyStart
Java snippet
algorithmInstance.clusterBuilder.phraseLengthPenaltyStart

Phrase length penalty start. The phrase length at which the overlong multi-word labels should start to be penalized. Phrases of length smaller than phraseLengthPenaltyStart will not be penalized.

phraseLengthPenaltyStop

Type
Integer
Default
8
Constraints
value >= 2 and value <= 8
Path
clusterBuilder.phraseLengthPenaltyStop
Java snippet
algorithmInstance.clusterBuilder.phraseLengthPenaltyStop

Phrase length penalty stop. The phrase length at which the overlong multi-word labels should be removed completely. Phrases of length larger than phraseLengthPenaltyStop will be removed.

desiredClusterCount

Type
Integer
Default
30
Constraints
value >= 2 and value <= 100
Path
desiredClusterCount
Java snippet
algorithmInstance.desiredClusterCount

Desired cluster count. A factor used to calculate the number of clusters based on the number of documents on input. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportionally adjusted to the desired cluster count, but may be different.

boostFields

Type
String[]
Default
[]
Path
matrixBuilder.boostFields
Java snippet
algorithmInstance.matrixBuilder.boostFields

Specifies a list of field names that are boosted by boostedFieldWeight attribute. Content of fields provided in this attribute can be given more weight during clustering.

boostedFieldWeight

Type
Double
Default
2
Constraints
value >= 0.0 and value <= 10.0
Path
matrixBuilder.boostedFieldWeight
Java snippet
algorithmInstance.matrixBuilder.boostedFieldWeight

Gives more weight to words that appeared in title fields. The larger the value, the stronger boost the title words will receive.

maxWordDf

Type
Double
Default
0.9
Constraints
value >= 0.0 and value <= 1.0
Path
matrixBuilder.maxWordDf
Java snippet
algorithmInstance.matrixBuilder.maxWordDf

Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf will be ignored. For example, when org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting org.carrot2.text.vsm.TermDocumentMatrixBuilder#maxWordDf to extremely low values: 0.1 or 0.05.

maximumMatrixSize

Type
Integer
Default
37500
Constraints
value >= 5000
Path
matrixBuilder.maximumMatrixSize
Java snippet
algorithmInstance.matrixBuilder.maximumMatrixSize

Maximum term-document matrix size. Determines the maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.

termWeighting

Type
org.carrot2.text.vsm.TermWeighting
Default
LogTfIdfTermWeighting
Path
matrixBuilder.termWeighting
Java snippet
algorithmInstance.matrixBuilder.termWeighting

Term weighting. The method for calculating weight of words in the term-document matrices.

Available implementations:

  1. LinearTfIdfTermWeighting

    Calculates term-document matrix element values based on Linear Inverse Term Frequency.

  2. LogTfIdfTermWeighting

    Calculates term-document matrix element values based on Log Inverse Term Frequency.

  3. TfTermWeighting

    Calculates term-document matrix element values based on Term Frequency.

factorizationFactory

Type
org.carrot2.math.matrix.MatrixFactorizationFactory
Default
NonnegativeMatrixFactorizationEDFactory
Path
matrixReducer.factorizationFactory
Java snippet
algorithmInstance.matrixReducer.factorizationFactory

Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels.

Available implementations:

  1. KMeansMatrixFactorizationFactory

    Performs matrix factorization using the k-means clustering algorithm. This kind of factorization is sometimes referred to as Concept Decomposition Factorization.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.KMeansMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  2. LocalNonnegativeMatrixFactorizationFactory

    Performs matrix factorization using the Local Non-negative Matrix Factorization algorithm with minimization of the Kullback-Leibler divergence between A and UV' and multiplicative updating.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.LocalNonnegativeMatrixFactorizationFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  3. NonnegativeMatrixFactorizationEDFactory

    Performs matrix factorization using the Non-negative Matrix Factorization algorithm with minimization of Euclidean Distance between A and UV' and multiplicative updating.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationEDFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  4. NonnegativeMatrixFactorizationKLFactory

    Performs matrix factorization using the Non-negative Matrix Factorization by minimization of Kullback-Leibler divergence between A and UV' and multiplicative updating.

    factorizationQuality

    Type
    org.carrot2.math.matrix.FactorizationQuality
    Default
    HIGH
    Constraints
    value in [LOW, MEDIUM, HIGH]
    Path
    matrixReducer.factorizationFactory.factorizationQuality
    Java snippet
    ((org.carrot2.math.matrix.NonnegativeMatrixFactorizationKLFactory) algorithmInstance.matrixReducer.factorizationFactory).factorizationQuality

    Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.

  5. PartialSingularValueDecompositionFactory

    Performs matrix factorization using the Singular Value Decomposition algorithm.

exactPhraseAssignment

Type
Boolean
Default
false
Path
preprocessing.documentAssigner.exactPhraseAssignment
Java snippet
algorithmInstance.preprocessing.documentAssigner.exactPhraseAssignment

Only exact phrase assignments. When set to true, clusters will contain only the documents that contain the cluster's label in its original form, including the order of words. Enabling this option will cause fewer documents to be put in clusters, increasing the precision of assignment, but also increasing the "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments.

minClusterSize

Type
Integer
Default
2
Constraints
value >= 1 and value <= 100
Path
preprocessing.documentAssigner.minClusterSize
Java snippet
algorithmInstance.preprocessing.documentAssigner.minClusterSize

Determines the minimum number of documents in each cluster.

labelOverrideThreshold

Type
Double
Default
0.65
Constraints
value >= 0.0 and value <= 1.0
Path
preprocessing.labelFilters.completeLabelFilter.labelOverrideThreshold
Java snippet
algorithmInstance.preprocessing.labelFilters.completeLabelFilter.labelOverrideThreshold

Truncated label threshold. Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.

minLength

Type
Integer
Default
3
Path
preprocessing.labelFilters.minLengthLabelFilter.minLength
Java snippet
algorithmInstance.preprocessing.labelFilters.minLengthLabelFilter.minLength

Minimum label length. Minimum label length, in words, inclusive.

phraseDfThreshold

Type
Integer
Default
1
Constraints
value >= 1 and value <= 100
Path
preprocessing.phraseDfThreshold
Java snippet
algorithmInstance.preprocessing.phraseDfThreshold

Phrase Document Frequency threshold. Phrases appearing in fewer than phraseDfThreshold documents will be ignored.

wordDfThreshold

Type
Integer
Default
1
Constraints
value >= 1 and value <= 100
Path
preprocessing.wordDfThreshold
Java snippet
algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency threshold. Words appearing in fewer than wordDfThreshold documents will be ignored.

queryHint

Type
String
Default
null
Path
queryHint
Java snippet
algorithmInstance.queryHint

Query hint. Query terms used to retrieve documents being clustered. The query is used as a hint to avoid creating trivial clusters consisting only of query words.

scoreWeight

Type
Double
Default
0
Constraints
value >= 0.0 and value <= 1.0
Path
scoreWeight
Java snippet
algorithmInstance.scoreWeight

Balance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score.