STC parameters

You can tune various aspects of STC clustering by changing some of the parameters of the algorithm.

Below is the list of algorithm parameters along with their default values. Click on the property name to see the available documentation.

dictionaries

Type
org.carrot2.language.EphemeralDictionaries
Default
EphemeralDictionaries
Path
dictionaries
Java snippet
algorithmInstance.dictionaries

Per-request overrides of language components (dictionaries).

labelFilters

Type
org.carrot2.language.LabelFilterDictionary[]
Default
[]
Path
dictionaries.labelFilters
Java snippet
algorithmInstance.dictionaries.labelFilters

Additional label filtering dictionaries (supplying cluster label filters that should be discarded from the output).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"labelFilters": [{
   "exact": ["Cluster Label 1", "Foo Bar"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

wordFilters

Type
org.carrot2.language.StopwordFilterDictionary[]
Default
[]
Path
dictionaries.wordFilters
Java snippet
algorithmInstance.dictionaries.wordFilters

Additional stop word filtering dictionaries (supplying word filters that should be discarded from the input).

One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl supports exact string matching and regular expression patterns.

REST-style example using the default implementation:

"wordFilters": [{
   "exact": ["word1", "word2"],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

documentCountBoost

Type
Double
Default
1
Constraints
value >= 0.0
Path
documentCountBoost
Java snippet
algorithmInstance.documentCountBoost

Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.

ignoreWordIfInHigherDocsPercent

Type
Double
Default
0.9
Constraints
value >= 0.0 and value <= 1.0
Path
ignoreWordIfInHigherDocsPercent
Java snippet
algorithmInstance.ignoreWordIfInHigherDocsPercent

Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.

maxBaseClusters

Type
Integer
Default
300
Constraints
value >= 2
Path
maxBaseClusters
Java snippet
algorithmInstance.maxBaseClusters

Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.

maxClusters

Type
Integer
Default
15
Constraints
value >= 1
Path
maxClusters
Java snippet
algorithmInstance.maxClusters

Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.

maxPhraseOverlap

Type
Double
Default
0.6
Constraints
value >= 0.0 and value <= 1.0
Path
maxPhraseOverlap
Java snippet
algorithmInstance.maxPhraseOverlap

Maximum cluster phrase overlap.

maxPhrasesPerLabel

Type
Integer
Default
3
Constraints
value >= 1
Path
maxPhrasesPerLabel
Java snippet
algorithmInstance.maxPhrasesPerLabel

Maximum number of phrases from base clusters to promote to the cluster's label.

maxWordsPerLabel

Type
Integer
Default
4
Constraints
value >= 1
Path
maxWordsPerLabel
Java snippet
algorithmInstance.maxWordsPerLabel

Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.

mergeStemEquivalentBaseClusters

Type
Boolean
Default
true
Path
mergeStemEquivalentBaseClusters
Java snippet
algorithmInstance.mergeStemEquivalentBaseClusters

Merge all stem-equivalent base clusters before running the merge phase.

mergeThreshold

Type
Double
Default
0.6
Constraints
value >= 0.0 and value <= 1.0
Path
mergeThreshold
Java snippet
algorithmInstance.mergeThreshold

Base cluster merge threshold.

minBaseClusterScore

Type
Double
Default
2
Constraints
value >= 0.0 and value <= 10.0
Path
minBaseClusterScore
Java snippet
algorithmInstance.minBaseClusterScore

Minimum base cluster score, before coverage merging.

minBaseClusterSize

Type
Integer
Default
2
Constraints
value >= 2 and value <= 20
Path
minBaseClusterSize
Java snippet
algorithmInstance.minBaseClusterSize

Minimum required number of documents in a base cluster.

mostGeneralPhraseCoverage

Type
Double
Default
0.5
Constraints
value >= 0.0 and value <= 1.0
Path
mostGeneralPhraseCoverage
Java snippet
algorithmInstance.mostGeneralPhraseCoverage

Minimum coverage required for a phrase to appear in cluster description.

optimalPhraseLength

Type
Integer
Default
3
Constraints
value >= 1
Path
optimalPhraseLength
Java snippet
algorithmInstance.optimalPhraseLength

Optimal label length. A factor in calculation of the base cluster score.

optimalPhraseLengthDev

Type
Double
Default
2
Constraints
value >= 0.5
Path
optimalPhraseLengthDev
Java snippet
algorithmInstance.optimalPhraseLengthDev

Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.

preprocessing

Type
org.carrot2.text.preprocessing.BasicPreprocessingPipeline
Default
BasicPreprocessingPipeline
Path
preprocessing
Java snippet
algorithmInstance.preprocessing

Configuration of the text preprocessing stage.

wordDfThreshold

Type
Integer
Default
1
Constraints
value >= 1 and value <= 100
Path
preprocessing.wordDfThreshold
Java snippet
algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

queryHint

Type
String
Default
null
Path
queryHint
Java snippet
algorithmInstance.queryHint

Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.

scoreWeight

Type
Double
Default
0
Constraints
value >= 0.0 and value <= 1.0
Path
scoreWeight
Java snippet
algorithmInstance.scoreWeight

Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.

singleTermBoost

Type
Double
Default
0.5
Constraints
value >= 0.0
Path
singleTermBoost
Java snippet
algorithmInstance.singleTermBoost

Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.