Package org.carrot2.clustering.stc
Class STCClusteringAlgorithm
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.clustering.stc.STCClusteringAlgorithm
- All Implemented Interfaces:
AcceptingVisitor
,ClusteringAlgorithm
public final class STCClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni,
Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were
taken wherever STC's description was not clear enough or where we thought some improvements could
be made.
-
Field Summary
Fields Modifier and Type Field Description AttrDouble
documentCountBoost
Document count boost.AttrDouble
ignoreWordIfInHigherDocsPercent
Maximum word-document ratio.AttrInteger
maxBaseClusters
Maximum base clusters count.AttrInteger
maxClusters
Maximum number of final clusters.AttrDouble
maxPhraseOverlap
Maximum cluster phrase overlap.AttrInteger
maxPhrasesPerLabel
Maximum phrases per label.AttrInteger
maxWordsPerLabel
Maximum words per label.AttrBoolean
mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.AttrDouble
mergeThreshold
Base cluster merge threshold.AttrDouble
minBaseClusterScore
Minimum base cluster score (before coverage merging).AttrInteger
minBaseClusterSize
Minimum documents per base cluster.AttrDouble
mostGeneralPhraseCoverage
Minimum general phrase coverage.static String
NAME
AttrInteger
optimalPhraseLength
Optimal label length.AttrDouble
optimalPhraseLengthDev
Phrase length tolerance.BasicPreprocessingPipeline
preprocessing
Preprocessing pipeline.AttrString
queryHint
Query terms used to retrieve documents.AttrDouble
scoreWeight
Balance between cluster score and size during cluster sorting.AttrDouble
singleTermBoost
Single term boost. -
Constructor Summary
Constructors Constructor Description STCClusteringAlgorithm()
-
Method Summary
Modifier and Type Method Description <T extends Document>
List<Cluster<T>>cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Performs STC clustering of documents.Set<Class<?>>
requiredLanguageComponents()
-
Field Details
-
NAME
- See Also:
- Constant Field Values
-
queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters. -
ignoreWordIfInHigherDocsPercent
Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored. -
minBaseClusterScore
Minimum base cluster score (before coverage merging). -
minBaseClusterSize
Minimum documents per base cluster. -
maxBaseClusters
Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase. -
maxClusters
Maximum number of final clusters. -
mergeThreshold
Base cluster merge threshold. -
maxPhraseOverlap
Maximum cluster phrase overlap. -
mostGeneralPhraseCoverage
Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description. -
maxWordsPerLabel
Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed. -
maxPhrasesPerLabel
Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label. -
singleTermBoost
Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function. -
optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score. -
optimalPhraseLengthDev
Phrase length tolerance. A factor in calculation of the base cluster score. -
documentCountBoost
Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster. -
scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score. -
mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.- See Also:
- "http://issues.carrot2.org/browse/CARROT-1008"
-
preprocessing
Preprocessing pipeline.
-
-
Constructor Details
-
STCClusteringAlgorithm
public STCClusteringAlgorithm()
-
-
Method Details
-
requiredLanguageComponents
- Specified by:
requiredLanguageComponents
in interfaceClusteringAlgorithm
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)Performs STC clustering of documents.- Specified by:
cluster
in interfaceClusteringAlgorithm
-