STC parameters
You can tune various aspects of STC clustering by changing some of the parameters of the algorithm.
Below is the list of algorithm parameters along with their default values. Click on the property name to see the available documentation.
{"singleTermBoost": 0.5}
dictionaries
- Type
- org.carrot2.language.EphemeralDictionaries
- Default
- EphemeralDictionaries
- Path
- dictionaries
- Java snippet
- algorithmInstance.dictionaries
Per-request overrides of language components (dictionaries).
{}
labelFilters
- Type
- org.carrot2.language.LabelFilterDictionary[]
- Default
- []
- Path
- dictionaries.labelFilters
- Java snippet
- algorithmInstance.dictionaries.labelFilters
Additional label filtering dictionaries (supplying cluster label filters that should be discarded from the output).
One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl
supports exact string matching and regular expression patterns.
REST-style example using the default implementation:
"labelFilters": [{
"exact": ["Cluster Label 1", "Foo Bar"],
"regexp": [
"(?).+pattern1.+",
"(?).+[0-9]{2}.+"
]
}]
wordFilters
- Type
- org.carrot2.language.StopwordFilterDictionary[]
- Default
- []
- Path
- dictionaries.wordFilters
- Java snippet
- algorithmInstance.dictionaries.wordFilters
Additional stop word filtering dictionaries (supplying word filters that should be discarded from the input).
One or more dictionaries can be supplied. The default implementation in org.carrot2.language.DefaultDictionaryImpl
supports exact string matching and regular expression patterns.
REST-style example using the default implementation:
"wordFilters": [{
"exact": ["word1", "word2"],
"regexp": [
"(?).+pattern1.+",
"(?).+[0-9]{2}.+"
]
}]
documentCountBoost
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0
- Path
- documentCountBoost
- Java snippet
- algorithmInstance.documentCountBoost
Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
ignoreWordIfInHigherDocsPercent
- Type
- Double
- Default
- 0.9
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- ignoreWordIfInHigherDocsPercent
- Java snippet
- algorithmInstance.ignoreWordIfInHigherDocsPercent
Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
maxBaseClusters
- Type
- Integer
- Default
- 300
- Constraints
- value >= 2
- Path
- maxBaseClusters
- Java snippet
- algorithmInstance.maxBaseClusters
Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.
maxClusters
- Type
- Integer
- Default
- 15
- Constraints
- value >= 1
- Path
- maxClusters
- Java snippet
- algorithmInstance.maxClusters
Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.
maxPhraseOverlap
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- maxPhraseOverlap
- Java snippet
- algorithmInstance.maxPhraseOverlap
Maximum cluster phrase overlap.
maxPhrasesPerLabel
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1
- Path
- maxPhrasesPerLabel
- Java snippet
- algorithmInstance.maxPhrasesPerLabel
Maximum number of phrases from base clusters to promote to the cluster's label.
maxWordsPerLabel
- Type
- Integer
- Default
- 4
- Constraints
- value >= 1
- Path
- maxWordsPerLabel
- Java snippet
- algorithmInstance.maxWordsPerLabel
Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.
mergeStemEquivalentBaseClusters
- Type
- Boolean
- Default
- true
- Path
- mergeStemEquivalentBaseClusters
- Java snippet
- algorithmInstance.mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.
mergeThreshold
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- mergeThreshold
- Java snippet
- algorithmInstance.mergeThreshold
Base cluster merge threshold.
minBaseClusterScore
- Type
- Double
- Default
- 2
- Constraints
- value >= 0.0 and value <= 10.0
- Path
- minBaseClusterScore
- Java snippet
- algorithmInstance.minBaseClusterScore
Minimum base cluster score, before coverage merging.
minBaseClusterSize
- Type
- Integer
- Default
- 2
- Constraints
- value >= 2 and value <= 20
- Path
- minBaseClusterSize
- Java snippet
- algorithmInstance.minBaseClusterSize
Minimum required number of documents in a base cluster.
mostGeneralPhraseCoverage
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- mostGeneralPhraseCoverage
- Java snippet
- algorithmInstance.mostGeneralPhraseCoverage
Minimum coverage required for a phrase to appear in cluster description.
optimalPhraseLength
- Type
- Integer
- Default
- 3
- Constraints
- value >= 1
- Path
- optimalPhraseLength
- Java snippet
- algorithmInstance.optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score.
optimalPhraseLengthDev
- Type
- Double
- Default
- 2
- Constraints
- value >= 0.5
- Path
- optimalPhraseLengthDev
- Java snippet
- algorithmInstance.optimalPhraseLengthDev
Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.
preprocessing
- Type
- org.carrot2.text.preprocessing.BasicPreprocessingPipeline
- Default
- BasicPreprocessingPipeline
- Path
- preprocessing
- Java snippet
- algorithmInstance.preprocessing
Configuration of the text preprocessing stage.
{"wordDfThreshold": 1}
wordDfThreshold
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 100
- Path
- preprocessing.wordDfThreshold
- Java snippet
- algorithmInstance.preprocessing.wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer than dfThreshold
documents will be ignored.
queryHint
- Type
- String
- Default
- null
- Path
- queryHint
- Java snippet
- algorithmInstance.queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
scoreWeight
- Type
- Double
- Default
- 0
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoreWeight
- Java snippet
- algorithmInstance.scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
singleTermBoost
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0
- Path
- singleTermBoost
- Java snippet
- algorithmInstance.singleTermBoost
Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.