Package org.carrot2.clustering.stc
Class STCClusteringAlgorithm
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.clustering.stc.STCClusteringAlgorithm
- All Implemented Interfaces:
AcceptingVisitor,ClusteringAlgorithm
public final class STCClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni,
Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were
taken wherever STC's description was not clear enough or where we thought some improvements could
be made.
-
Field Summary
Fields Modifier and Type Field Description AttrDoubledocumentCountBoostDocument count boost.AttrDoubleignoreWordIfInHigherDocsPercentMaximum word-document ratio.AttrIntegermaxBaseClustersMaximum base clusters count.AttrIntegermaxClustersMaximum number of final clusters.AttrDoublemaxPhraseOverlapMaximum cluster phrase overlap.AttrIntegermaxPhrasesPerLabelMaximum phrases per label.AttrIntegermaxWordsPerLabelMaximum words per label.AttrBooleanmergeStemEquivalentBaseClustersMerge all stem-equivalent base clusters before running the merge phase.AttrDoublemergeThresholdBase cluster merge threshold.AttrDoubleminBaseClusterScoreMinimum base cluster score (before coverage merging).AttrIntegerminBaseClusterSizeMinimum documents per base cluster.AttrDoublemostGeneralPhraseCoverageMinimum general phrase coverage.static StringNAMEAttrIntegeroptimalPhraseLengthOptimal label length.AttrDoubleoptimalPhraseLengthDevPhrase length tolerance.BasicPreprocessingPipelinepreprocessingPreprocessing pipeline.AttrStringqueryHintQuery terms used to retrieve documents.AttrDoublescoreWeightBalance between cluster score and size during cluster sorting.AttrDoublesingleTermBoostSingle term boost. -
Constructor Summary
Constructors Constructor Description STCClusteringAlgorithm() -
Method Summary
Modifier and Type Method Description <T extends Document>
List<Cluster<T>>cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)Performs STC clustering of documents.Set<Class<?>>requiredLanguageComponents()
-
Field Details
-
NAME
- See Also:
- Constant Field Values
-
queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters. -
ignoreWordIfInHigherDocsPercent
Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored. -
minBaseClusterScore
Minimum base cluster score (before coverage merging). -
minBaseClusterSize
Minimum documents per base cluster. -
maxBaseClusters
Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase. -
maxClusters
Maximum number of final clusters. -
mergeThreshold
Base cluster merge threshold. -
maxPhraseOverlap
Maximum cluster phrase overlap. -
mostGeneralPhraseCoverage
Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description. -
maxWordsPerLabel
Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed. -
maxPhrasesPerLabel
Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label. -
singleTermBoost
Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function. -
optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score. -
optimalPhraseLengthDev
Phrase length tolerance. A factor in calculation of the base cluster score. -
documentCountBoost
Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster. -
scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score. -
mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.- See Also:
- "http://issues.carrot2.org/browse/CARROT-1008"
-
preprocessing
Preprocessing pipeline.
-
-
Constructor Details
-
STCClusteringAlgorithm
public STCClusteringAlgorithm()
-
-
Method Details
-
requiredLanguageComponents
- Specified by:
requiredLanguageComponentsin interfaceClusteringAlgorithm
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)Performs STC clustering of documents.- Specified by:
clusterin interfaceClusteringAlgorithm
-