Package org.carrot2.clustering.stc
Class STCClusteringAlgorithm
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.clustering.stc.STCClusteringAlgorithm
-
- All Implemented Interfaces:
AcceptingVisitor
,ClusteringAlgorithm
public final class STCClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.
-
-
Field Summary
Fields Modifier and Type Field Description AttrDouble
documentCountBoost
Document count boost.AttrDouble
ignoreWordIfInHigherDocsPercent
Maximum word-document ratio.AttrInteger
maxBaseClusters
Maximum base clusters count.AttrInteger
maxClusters
Maximum number of final clusters.AttrDouble
maxPhraseOverlap
Maximum cluster phrase overlap.AttrInteger
maxPhrasesPerLabel
Maximum phrases per label.AttrInteger
maxWordsPerLabel
Maximum words per label.AttrBoolean
mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.AttrDouble
mergeThreshold
Base cluster merge threshold.AttrDouble
minBaseClusterScore
Minimum base cluster score (before coverage merging).AttrInteger
minBaseClusterSize
Minimum documents per base cluster.AttrDouble
mostGeneralPhraseCoverage
Minimum general phrase coverage.static String
NAME
AttrInteger
optimalPhraseLength
Optimal label length.AttrDouble
optimalPhraseLengthDev
Phrase length tolerance.BasicPreprocessingPipeline
preprocessing
Preprocessing pipeline.AttrString
queryHint
Query terms used to retrieve documents.AttrDouble
scoreWeight
Balance between cluster score and size during cluster sorting.AttrDouble
singleTermBoost
Single term boost.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description STCClusteringAlgorithm()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description <T extends Document>
List<Cluster<T>>cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Performs STC clustering of documents.boolean
supports(LanguageComponents languageComponents)
-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.carrot2.attrs.AcceptingVisitor
accept
-
-
-
-
Field Detail
-
NAME
public static final String NAME
- See Also:
- Constant Field Values
-
queryHint
public final AttrString queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
-
ignoreWordIfInHigherDocsPercent
public AttrDouble ignoreWordIfInHigherDocsPercent
Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
-
minBaseClusterScore
public AttrDouble minBaseClusterScore
Minimum base cluster score (before coverage merging).
-
minBaseClusterSize
public AttrInteger minBaseClusterSize
Minimum documents per base cluster.
-
maxBaseClusters
public AttrInteger maxBaseClusters
Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.
-
maxClusters
public AttrInteger maxClusters
Maximum number of final clusters.
-
mergeThreshold
public AttrDouble mergeThreshold
Base cluster merge threshold.
-
maxPhraseOverlap
public AttrDouble maxPhraseOverlap
Maximum cluster phrase overlap.
-
mostGeneralPhraseCoverage
public AttrDouble mostGeneralPhraseCoverage
Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.
-
maxWordsPerLabel
public AttrInteger maxWordsPerLabel
Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.
-
maxPhrasesPerLabel
public AttrInteger maxPhrasesPerLabel
Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.
-
singleTermBoost
public AttrDouble singleTermBoost
Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
-
optimalPhraseLength
public AttrInteger optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score.
-
optimalPhraseLengthDev
public AttrDouble optimalPhraseLengthDev
Phrase length tolerance. A factor in calculation of the base cluster score.
-
documentCountBoost
public AttrDouble documentCountBoost
Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
-
scoreWeight
public AttrDouble scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
-
mergeStemEquivalentBaseClusters
public AttrBoolean mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.- See Also:
- "http://issues.carrot2.org/browse/CARROT-1008"
-
preprocessing
public BasicPreprocessingPipeline preprocessing
Preprocessing pipeline.
-
-
Method Detail
-
supports
public boolean supports(LanguageComponents languageComponents)
- Specified by:
supports
in interfaceClusteringAlgorithm
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Performs STC clustering of documents.- Specified by:
cluster
in interfaceClusteringAlgorithm
-
-