STC parameters

You can tune various aspects of STC clustering by changing some of the parameters of the algorithm.

Below is the list of algorithm parameters along with their default values. Click on the property name to see the available documentation.

{"dictionaries": {"labelFilters": [],
"wordFilters": []
},
"documentCountBoost": 1,
"ignoreWordIfInHigherDocsPercent": 0.9,
"maxBaseClusters": 300,
"maxClusters": 15,
"maxPhraseOverlap": 0.6,
"maxPhrasesPerLabel": 3,
"maxWordsPerLabel": 4,
"mergeStemEquivalentBaseClusters": true,
"mergeThreshold": 0.6,
"minBaseClusterScore": 2,
"minBaseClusterSize": 2,
"mostGeneralPhraseCoverage": 0.5,
"optimalPhraseLength": 3,
"optimalPhraseLengthDev": 2,
"preprocessing": {"wordDfThreshold": 1
},
"queryHint": null,
"scoreWeight": 0,
"singleTermBoost": 0.5
}

dictionaries

Type: org.carrot2.language.EphemeralDictionaries
Default: EphemeralDictionaries
Path: dictionaries
Java snippet: algorithmInstance.dictionaries

Per-request overrides of language components (dictionaries).

{"labelFilters": [],
"wordFilters": []
}

labelFilters

Type: org.carrot2.language.LabelFilterDictionary[]
Default: []
Path: dictionaries.labelFilters
Java snippet: algorithmInstance.dictionaries.labelFilters

Additional label filtering dictionaries (supplying cluster label filters that should be discarded from the output).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"labelFilters": [{
   "exact": ["Cluster Label 1", "Foo Bar"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

wordFilters

Type: org.carrot2.language.StopwordFilterDictionary[]
Default: []
Path: dictionaries.wordFilters
Java snippet: algorithmInstance.dictionaries.wordFilters

Additional stop word filtering dictionaries (supplying word filters that should be discarded from the input).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"wordFilters": [{
   "exact": ["word1", "word2"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

documentCountBoost

Type: Double
Default: 1
Constraints: value >= 0.0
Path: documentCountBoost
Java snippet: algorithmInstance.documentCountBoost

Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.

ignoreWordIfInHigherDocsPercent

Type: Double
Default: 0.9
Constraints: value >= 0.0 and value <= 1.0
Path: ignoreWordIfInHigherDocsPercent
Java snippet: algorithmInstance.ignoreWordIfInHigherDocsPercent

Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.

maxBaseClusters

Type: Integer
Default: 300
Constraints: value >= 2
Path: maxBaseClusters
Java snippet: algorithmInstance.maxBaseClusters

Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.

maxClusters

Type: Integer
Default: 15
Constraints: value >= 1
Path: maxClusters
Java snippet: algorithmInstance.maxClusters

Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.

maxPhraseOverlap

Type: Double
Default: 0.6
Constraints: value >= 0.0 and value <= 1.0
Path: maxPhraseOverlap
Java snippet: algorithmInstance.maxPhraseOverlap

Maximum cluster phrase overlap.

maxPhrasesPerLabel

Type: Integer
Default: 3
Constraints: value >= 1
Path: maxPhrasesPerLabel
Java snippet: algorithmInstance.maxPhrasesPerLabel

Maximum number of phrases from base clusters to promote to the cluster's label.

maxWordsPerLabel

Type: Integer
Default: 4
Constraints: value >= 1
Path: maxWordsPerLabel
Java snippet: algorithmInstance.maxWordsPerLabel

Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.

mergeStemEquivalentBaseClusters

Type: Boolean
Default: true
Path: mergeStemEquivalentBaseClusters
Java snippet: algorithmInstance.mergeStemEquivalentBaseClusters

Merge all stem-equivalent base clusters before running the merge phase.

mergeThreshold

Type: Double
Default: 0.6
Constraints: value >= 0.0 and value <= 1.0
Path: mergeThreshold
Java snippet: algorithmInstance.mergeThreshold

Base cluster merge threshold.

minBaseClusterScore

Type: Double
Default: 2
Constraints: value >= 0.0 and value <= 10.0
Path: minBaseClusterScore
Java snippet: algorithmInstance.minBaseClusterScore

Minimum base cluster score, before coverage merging.

minBaseClusterSize

Type: Integer
Default: 2
Constraints: value >= 2 and value <= 20
Path: minBaseClusterSize
Java snippet: algorithmInstance.minBaseClusterSize

Minimum required number of documents in a base cluster.

mostGeneralPhraseCoverage

Type: Double
Default: 0.5
Constraints: value >= 0.0 and value <= 1.0
Path: mostGeneralPhraseCoverage
Java snippet: algorithmInstance.mostGeneralPhraseCoverage

Minimum coverage required for a phrase to appear in cluster description.

optimalPhraseLength

Type: Integer
Default: 3
Constraints: value >= 1
Path: optimalPhraseLength
Java snippet: algorithmInstance.optimalPhraseLength

Optimal label length. A factor in calculation of the base cluster score.

optimalPhraseLengthDev

Type: Double
Default: 2
Constraints: value >= 0.5
Path: optimalPhraseLengthDev
Java snippet: algorithmInstance.optimalPhraseLengthDev

Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.

preprocessing

Type: org.carrot2.text.preprocessing.BasicPreprocessingPipeline
Default: BasicPreprocessingPipeline
Path: preprocessing
Java snippet: algorithmInstance.preprocessing

Configuration of the text preprocessing stage.

{"wordDfThreshold": 1
}

wordDfThreshold

Type: Integer
Default: 1
Constraints: value >= 1 and value <= 100
Path: preprocessing.wordDfThreshold
Java snippet: algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

queryHint

Type: String
Default: null
Path: queryHint
Java snippet: algorithmInstance.queryHint

Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.

scoreWeight

Type: Double
Default: 0
Constraints: value >= 0.0 and value <= 1.0
Path: scoreWeight
Java snippet: algorithmInstance.scoreWeight

Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.

singleTermBoost

Type: Double
Default: 0.5
Constraints: value >= 0.0
Path: singleTermBoost
Java snippet: algorithmInstance.singleTermBoost

Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.

previous article
Lingo parameters

next article
k-means parameters

API elements

Sections and content