Package org.carrot2.clustering.stc
Class STCClusteringAlgorithm
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.clustering.stc.STCClusteringAlgorithm
-
- All Implemented Interfaces:
AcceptingVisitor,ClusteringAlgorithm
public final class STCClusteringAlgorithm extends AttrComposite implements ClusteringAlgorithm
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.
-
-
Field Summary
Fields Modifier and Type Field Description EphemeralDictionariesdictionariesPer-request overrides of language components (dictionaries).AttrDoubledocumentCountBoostBase cluster document count boost.AttrDoubleignoreWordIfInHigherDocsPercentIgnore words appearing in more than the provided fraction of documents.AttrIntegermaxBaseClustersMaximum number of base cluster.AttrIntegermaxClustersMaximum number final clusters to keep.AttrDoublemaxPhraseOverlapMaximum cluster phrase overlap.AttrIntegermaxPhrasesPerLabelMaximum number of phrases from base clusters to promote to the cluster's label.AttrIntegermaxWordsPerLabelMaximum allowed number of words per label.AttrBooleanmergeStemEquivalentBaseClustersMerge all stem-equivalent base clusters before running the merge phase.AttrDoublemergeThresholdBase cluster merge threshold.AttrDoubleminBaseClusterScoreMinimum base cluster score, before coverage merging.AttrIntegerminBaseClusterSizeMinimum required number of documents in a base cluster.AttrDoublemostGeneralPhraseCoverageMinimum coverage required for a phrase to appear in cluster description.static StringNAMEAttrIntegeroptimalPhraseLengthOptimal label length.AttrDoubleoptimalPhraseLengthDevOptimal cluster label length's tolerance.BasicPreprocessingPipelinepreprocessingConfiguration of the text preprocessing stage.AttrStringqueryHintQuery terms used to retrieve documents.AttrDoublescoreWeightBalance between cluster score and size during cluster sorting.AttrDoublesingleTermBoostBase cluster score override for single-term clusters.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description STCClusteringAlgorithm()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description <T extends Document>
List<Cluster<T>>cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)Performs STC clustering of documents.Set<Class<?>>requiredLanguageComponents()-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.carrot2.attrs.AcceptingVisitor
accept
-
Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm
supports
-
-
-
-
Field Detail
-
NAME
public static final String NAME
- See Also:
- Constant Field Values
-
queryHint
public final AttrString queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
-
ignoreWordIfInHigherDocsPercent
public AttrDouble ignoreWordIfInHigherDocsPercent
Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
-
minBaseClusterScore
public AttrDouble minBaseClusterScore
Minimum base cluster score, before coverage merging.
-
minBaseClusterSize
public AttrInteger minBaseClusterSize
Minimum required number of documents in a base cluster.
-
maxBaseClusters
public AttrInteger maxBaseClusters
Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.
-
maxClusters
public AttrInteger maxClusters
Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.
-
mergeThreshold
public AttrDouble mergeThreshold
Base cluster merge threshold.
-
maxPhraseOverlap
public AttrDouble maxPhraseOverlap
Maximum cluster phrase overlap.
-
mostGeneralPhraseCoverage
public AttrDouble mostGeneralPhraseCoverage
Minimum coverage required for a phrase to appear in cluster description.
-
maxWordsPerLabel
public AttrInteger maxWordsPerLabel
Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.
-
maxPhrasesPerLabel
public AttrInteger maxPhrasesPerLabel
Maximum number of phrases from base clusters to promote to the cluster's label.
-
singleTermBoost
public AttrDouble singleTermBoost
Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
-
optimalPhraseLength
public AttrInteger optimalPhraseLength
Optimal label length. A factor in calculation of the base cluster score.
-
optimalPhraseLengthDev
public AttrDouble optimalPhraseLengthDev
Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.
-
documentCountBoost
public AttrDouble documentCountBoost
Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
-
scoreWeight
public AttrDouble scoreWeight
Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
-
mergeStemEquivalentBaseClusters
public AttrBoolean mergeStemEquivalentBaseClusters
Merge all stem-equivalent base clusters before running the merge phase.- See Also:
- "http://issues.carrot2.org/browse/CARROT-1008"
-
preprocessing
public BasicPreprocessingPipeline preprocessing
Configuration of the text preprocessing stage.
-
dictionaries
public EphemeralDictionaries dictionaries
Per-request overrides of language components (dictionaries).- Since:
- 4.1.0
-
-
Method Detail
-
requiredLanguageComponents
public Set<Class<?>> requiredLanguageComponents()
- Specified by:
requiredLanguageComponentsin interfaceClusteringAlgorithm
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
Performs STC clustering of documents.- Specified by:
clusterin interfaceClusteringAlgorithm
-
-