STC attributes

You can tune various aspects of STC clustering by changing some of the attributes of the algorithm.

Below is the list of algorithm attributes along with their default values. Click on the property name to see the available documentation.

documentCountBoost

Type
Double
Default
1
Constraints
value >= 0.0
Path
documentCountBoost
Java snippet
algorithmInstance.documentCountBoost

Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.

ignoreWordIfInHigherDocsPercent

Type
Double
Default
0.9
Constraints
value >= 0.0 and value <= 1.0
Path
ignoreWordIfInHigherDocsPercent
Java snippet
algorithmInstance.ignoreWordIfInHigherDocsPercent

Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.

maxBaseClusters

Type
Integer
Default
300
Constraints
value >= 2
Path
maxBaseClusters
Java snippet
algorithmInstance.maxBaseClusters

Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.

maxClusters

Type
Integer
Default
15
Constraints
value >= 1
Path
maxClusters
Java snippet
algorithmInstance.maxClusters

Maximum number of final clusters.

maxPhraseOverlap

Type
Double
Default
0.6
Constraints
value >= 0.0 and value <= 1.0
Path
maxPhraseOverlap
Java snippet
algorithmInstance.maxPhraseOverlap

Maximum cluster phrase overlap.

maxPhrasesPerLabel

Type
Integer
Default
3
Constraints
value >= 1
Path
maxPhrasesPerLabel
Java snippet
algorithmInstance.maxPhrasesPerLabel

Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.

maxWordsPerLabel

Type
Integer
Default
4
Constraints
value >= 1
Path
maxWordsPerLabel
Java snippet
algorithmInstance.maxWordsPerLabel

Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.

mergeStemEquivalentBaseClusters

Type
Boolean
Default
true
Path
mergeStemEquivalentBaseClusters
Java snippet
algorithmInstance.mergeStemEquivalentBaseClusters

Merge all stem-equivalent base clusters before running the merge phase.

mergeThreshold

Type
Double
Default
0.6
Constraints
value >= 0.0 and value <= 1.0
Path
mergeThreshold
Java snippet
algorithmInstance.mergeThreshold

Base cluster merge threshold.

minBaseClusterScore

Type
Double
Default
2
Constraints
value >= 0.0 and value <= 10.0
Path
minBaseClusterScore
Java snippet
algorithmInstance.minBaseClusterScore

Minimum base cluster score (before coverage merging).

minBaseClusterSize

Type
Integer
Default
2
Constraints
value >= 2 and value <= 20
Path
minBaseClusterSize
Java snippet
algorithmInstance.minBaseClusterSize

Minimum documents per base cluster.

mostGeneralPhraseCoverage

Type
Double
Default
0.5
Constraints
value >= 0.0 and value <= 1.0
Path
mostGeneralPhraseCoverage
Java snippet
algorithmInstance.mostGeneralPhraseCoverage

Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.

optimalPhraseLength

Type
Integer
Default
3
Constraints
value >= 1
Path
optimalPhraseLength
Java snippet
algorithmInstance.optimalPhraseLength

Optimal label length. A factor in calculation of the base cluster score.

optimalPhraseLengthDev

Type
Double
Default
2
Constraints
value >= 0.5
Path
optimalPhraseLengthDev
Java snippet
algorithmInstance.optimalPhraseLengthDev

Phrase length tolerance. A factor in calculation of the base cluster score.

wordDfThreshold

Type
Integer
Default
1
Constraints
value >= 1 and value <= 100
Path
preprocessing.wordDfThreshold
Java snippet
algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

queryHint

Type
String
Default
null
Path
queryHint
Java snippet
algorithmInstance.queryHint

Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.

scoreWeight

Type
Double
Default
0
Constraints
value >= 0.0 and value <= 1.0
Path
scoreWeight
Java snippet
algorithmInstance.scoreWeight

Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.

singleTermBoost

Type
Double
Default
0.5
Constraints
value >= 0.0
Path
singleTermBoost
Java snippet
algorithmInstance.singleTermBoost

Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.