STC attributes
You can tune various aspects of STC clustering by changing some of the attributes of the algorithm.
Below is the list of algorithm attributes along with their default values. Click on the property name to see the available documentation.
{"singleTermBoost": 0.5}
documentCountBoost
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0
- Path
- documentCountBoost
- Java snippet
- algorithmInstance.documentCountBoost
Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
ignoreWordIfInHigherDocsPercent
- Type
- Double
- Default
- 0.9
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- ignoreWordIfInHigherDocsPercent
- Java snippet
- algorithmInstance.ignoreWordIfInHigherDocsPercent
Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
maxBaseClusters
- Type
- Integer
- Default
- 300
- Constraints
- value >= 2
- Path
- maxBaseClusters
- Java snippet
- algorithmInstance.maxBaseClusters
Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.
maxClusters
- Type
- Integer
- Default
- 15
- Constraints
- value >= 1
- Path
- maxClusters
- Java snippet
- algorithmInstance.maxClusters
Maximum number of final clusters.
maxPhraseOverlap
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- maxPhraseOverlap
- Java snippet
- algorithmInstance.maxPhraseOverlap
Maximum cluster phrase overlap.
maxPhrasesPerLabel
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1
- Path
- maxPhrasesPerLabel
- Java snippet
- algorithmInstance.maxPhrasesPerLabel
Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.
maxWordsPerLabel
- Type
- Integer
- Default
- 4
- Constraints
- value >= 1
- Path
- maxWordsPerLabel
- Java snippet
- algorithmInstance.maxWordsPerLabel
Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.
mergeStemEquivalentBaseClusters
- Type
- Boolean
- Default
- true
- Path
- mergeStemEquivalentBaseClusters
- Java snippet
- algorithmInstance.mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.
mergeThreshold
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- mergeThreshold
- Java snippet
- algorithmInstance.mergeThreshold
Base cluster merge threshold.
minBaseClusterScore
- Type
- Double
- Default
- 2
- Constraints
- value >= 0.0 and value <= 10.0
- Path
- minBaseClusterScore
- Java snippet
- algorithmInstance.minBaseClusterScore
Minimum base cluster score (before coverage merging).
minBaseClusterSize
- Type
- Integer
- Default
- 2
- Constraints
- value >= 2 and value <= 20
- Path
- minBaseClusterSize
- Java snippet
- algorithmInstance.minBaseClusterSize
Minimum documents per base cluster.
mostGeneralPhraseCoverage
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- mostGeneralPhraseCoverage
- Java snippet
- algorithmInstance.mostGeneralPhraseCoverage
Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.
optimalPhraseLength
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1
- Path
- optimalPhraseLength
- Java snippet
- algorithmInstance.optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score.
optimalPhraseLengthDev
- Type
- Double
- Default
- 2
- Constraints
- value >= 0.5
- Path
- optimalPhraseLengthDev
- Java snippet
- algorithmInstance.optimalPhraseLengthDev
Phrase length tolerance. A factor in calculation of the base cluster score.
wordDfThreshold
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.wordDfThreshold
- Java snippet
- algorithmInstance.preprocessing.wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer than dfThreshold
documents will be ignored.
queryHint
- Type
- String
- Default
- null
- Path
- queryHint
- Java snippet
- algorithmInstance.queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
scoreWeight
- Type
- Double
- Default
- 0
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoreWeight
- Java snippet
- algorithmInstance.scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
singleTermBoost
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0
- Path
- singleTermBoost
- Java snippet
- algorithmInstance.singleTermBoost
Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.