STC attributes

You can tune various aspects of STC clustering by changing some of the attributes of the algorithm.

Below is the list of algorithm attributes along with their default values. Click on the property name to see the available documentation.

{"documentCountBoost": 1,
"ignoreWordIfInHigherDocsPercent": 0.9,
"maxBaseClusters": 300,
"maxClusters": 15,
"maxPhraseOverlap": 0.6,
"maxPhrasesPerLabel": 3,
"maxWordsPerLabel": 4,
"mergeStemEquivalentBaseClusters": true,
"mergeThreshold": 0.6,
"minBaseClusterScore": 2,
"minBaseClusterSize": 2,
"mostGeneralPhraseCoverage": 0.5,
"optimalPhraseLength": 3,
"optimalPhraseLengthDev": 2,
"preprocessing": {"wordDfThreshold": 1
},
"queryHint": null,
"scoreWeight": 0,
"singleTermBoost": 0.5
}

documentCountBoost

Type: Double
Default: 1
Constraints: value >= 0.0
Path: documentCountBoost
Java snippet: algorithmInstance.documentCountBoost

Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.

ignoreWordIfInHigherDocsPercent

Type: Double
Default: 0.9
Constraints: value >= 0.0 and value <= 1.0
Path: ignoreWordIfInHigherDocsPercent
Java snippet: algorithmInstance.ignoreWordIfInHigherDocsPercent

Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.

maxBaseClusters

Type: Integer
Default: 300
Constraints: value >= 2
Path: maxBaseClusters
Java snippet: algorithmInstance.maxBaseClusters

Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.

maxClusters

Type: Integer
Default: 15
Constraints: value >= 1
Path: maxClusters
Java snippet: algorithmInstance.maxClusters

Maximum number of final clusters.

maxPhraseOverlap

Type: Double
Default: 0.6
Constraints: value >= 0.0 and value <= 1.0
Path: maxPhraseOverlap
Java snippet: algorithmInstance.maxPhraseOverlap

Maximum cluster phrase overlap.

maxPhrasesPerLabel

Type: Integer
Default: 3
Constraints: value >= 1
Path: maxPhrasesPerLabel
Java snippet: algorithmInstance.maxPhrasesPerLabel

Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.

maxWordsPerLabel

Type: Integer
Default: 4
Constraints: value >= 1
Path: maxWordsPerLabel
Java snippet: algorithmInstance.maxWordsPerLabel

Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.

mergeStemEquivalentBaseClusters

Type: Boolean
Default: true
Path: mergeStemEquivalentBaseClusters
Java snippet: algorithmInstance.mergeStemEquivalentBaseClusters

Merge all stem-equivalent base clusters before running the merge phase.

mergeThreshold

Type: Double
Default: 0.6
Constraints: value >= 0.0 and value <= 1.0
Path: mergeThreshold
Java snippet: algorithmInstance.mergeThreshold

Base cluster merge threshold.

minBaseClusterScore

Type: Double
Default: 2
Constraints: value >= 0.0 and value <= 10.0
Path: minBaseClusterScore
Java snippet: algorithmInstance.minBaseClusterScore

Minimum base cluster score (before coverage merging).

minBaseClusterSize

Type: Integer
Default: 2
Constraints: value >= 2 and value <= 20
Path: minBaseClusterSize
Java snippet: algorithmInstance.minBaseClusterSize

Minimum documents per base cluster.

mostGeneralPhraseCoverage

Type: Double
Default: 0.5
Constraints: value >= 0.0 and value <= 1.0
Path: mostGeneralPhraseCoverage
Java snippet: algorithmInstance.mostGeneralPhraseCoverage

Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.

optimalPhraseLength

Type: Integer
Default: 3
Constraints: value >= 1
Path: optimalPhraseLength
Java snippet: algorithmInstance.optimalPhraseLength

Optimal label length. A factor in calculation of the base cluster score.

optimalPhraseLengthDev

Type: Double
Default: 2
Constraints: value >= 0.5
Path: optimalPhraseLengthDev
Java snippet: algorithmInstance.optimalPhraseLengthDev

Phrase length tolerance. A factor in calculation of the base cluster score.

wordDfThreshold

Type: Integer
Default: 1
Constraints: value >= 1 and value <= 100
Path: preprocessing.wordDfThreshold
Java snippet: algorithmInstance.preprocessing.wordDfThreshold

Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

queryHint

Type: String
Default: null
Path: queryHint
Java snippet: algorithmInstance.queryHint

Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.

scoreWeight

Type: Double
Default: 0
Constraints: value >= 0.0 and value <= 1.0
Path: scoreWeight
Java snippet: algorithmInstance.scoreWeight

Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.

singleTermBoost

Type: Double
Default: 0.5
Constraints: value >= 0.0
Path: singleTermBoost
Java snippet: algorithmInstance.singleTermBoost

Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.

previous article
Lingo attributes

next article
Bisecting k-means algorithm attributes

API elements

Sections and content