STC attributes
You can tune various aspects of STC clustering by changing some of the attributes of the algorithm.
Below is the list of algorithm attributes along with their default values. Click on the property name to see the available documentation.
{"singleTermBoost": 0.5}
documentCountBoost
- Type
- Double
- Default
- 1.0
- Constraints
- value >= 0.0
- Path
- documentCountBoost
- Java snippet
- algorithmInstance.documentCountBoost
A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
ignoreWordIfInHigherDocsPercent
- Type
- Double
- Default
- 0.9
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- ignoreWordIfInHigherDocsPercent
- Java snippet
- algorithmInstance.ignoreWordIfInHigherDocsPercent
A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
maxBaseClusters
- Type
- Integer
- Default
- 300
- Constraints
- value >= 2
- Path
- maxBaseClusters
- Java snippet
- algorithmInstance.maxBaseClusters
Trims the base cluster array after N-th position for the merging phase.
maxClusters
- Type
- Integer
- Default
- 15
- Constraints
- value >= 1
- Path
- maxClusters
- Java snippet
- algorithmInstance.maxClusters
Maximum number of final clusters.
maxPhraseOverlap
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- maxPhraseOverlap
- Java snippet
- algorithmInstance.maxPhraseOverlap
Maximum cluster phrase overlap.
maxPhrasesPerLabel
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1
- Path
- maxPhrasesPerLabel
- Java snippet
- algorithmInstance.maxPhrasesPerLabel
Maximum number of phrases from base clusters promoted to the cluster's label.
maxWordsPerLabel
- Type
- Integer
- Default
- 4
- Constraints
- value >= 1
- Path
- maxWordsPerLabel
- Java snippet
- algorithmInstance.maxWordsPerLabel
Base clusters formed by phrases with more words than this ratio are trimmed.
mergeStemEquivalentBaseClusters
- Type
- Boolean
- Default
- true
- Path
- mergeStemEquivalentBaseClusters
- Java snippet
- algorithmInstance.mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.
mergeThreshold
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- mergeThreshold
- Java snippet
- algorithmInstance.mergeThreshold
Base cluster merge threshold.
minBaseClusterScore
- Type
- Double
- Default
- 2.0
- Constraints
- value >= 0.0 and value <= 10.0
- Path
- minBaseClusterScore
- Java snippet
- algorithmInstance.minBaseClusterScore
Minimum base cluster score (before coverage merging).
minBaseClusterSize
- Type
- Integer
- Default
- 2
- Constraints
- value >= 2 and value <= 20
- Path
- minBaseClusterSize
- Java snippet
- algorithmInstance.minBaseClusterSize
Minimum documents per base cluster.
mostGeneralPhraseCoverage
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- mostGeneralPhraseCoverage
- Java snippet
- algorithmInstance.mostGeneralPhraseCoverage
Minimum phrase coverage to appear in cluster description.
optimalPhraseLength
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1
- Path
- optimalPhraseLength
- Java snippet
- algorithmInstance.optimalPhraseLength
A factor in calculation of the base cluster score.
optimalPhraseLengthDev
- Type
- Double
- Default
- 2.0
- Constraints
- value >= 0.5
- Path
- optimalPhraseLengthDev
- Java snippet
- algorithmInstance.optimalPhraseLengthDev
A factor in calculation of the base cluster score.
wordDfThreshold
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.wordDfThreshold
- Java snippet
- algorithmInstance.preprocessing.wordDfThreshold
Words appearing in fewer than dfThreshold
documents will be ignored.
queryHint
- Type
- String
- Default
- undefined
- Path
- queryHint
- Java snippet
- algorithmInstance.queryHint
The query is used as a hint to avoid trivial clusters.
scoreWeight
- Type
- Double
- Default
- 0.0
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoreWeight
- Java snippet
- algorithmInstance.scoreWeight
Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
singleTermBoost
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0
- Path
- singleTermBoost
- Java snippet
- algorithmInstance.singleTermBoost
A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.