org.carrot2.clustering.stc.STCClusteringAlgorithm

All Implemented Interfaces:: AcceptingVisitor, ClusteringAlgorithm

public final class STCClusteringAlgorithm
extends AttrComposite
implements ClusteringAlgorithm

Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.

Field Summary

Fields
Modifier and Type	Field	Description
`AttrDouble`	`documentCountBoost`	Document count boost.
`AttrDouble`	`ignoreWordIfInHigherDocsPercent`	Maximum word-document ratio.
`AttrInteger`	`maxBaseClusters`	Maximum base clusters count.
`AttrInteger`	`maxClusters`	Maximum number of final clusters.
`AttrDouble`	`maxPhraseOverlap`	Maximum cluster phrase overlap.
`AttrInteger`	`maxPhrasesPerLabel`	Maximum phrases per label.
`AttrInteger`	`maxWordsPerLabel`	Maximum words per label.
`AttrBoolean`	`mergeStemEquivalentBaseClusters`	Merge all stem-equivalent base clusters before running the merge phase.
`AttrDouble`	`mergeThreshold`	Base cluster merge threshold.
`AttrDouble`	`minBaseClusterScore`	Minimum base cluster score (before coverage merging).
`AttrInteger`	`minBaseClusterSize`	Minimum documents per base cluster.
`AttrDouble`	`mostGeneralPhraseCoverage`	Minimum general phrase coverage.
`static String`	`NAME`
`AttrInteger`	`optimalPhraseLength`	Optimal label length.
`AttrDouble`	`optimalPhraseLengthDev`	Phrase length tolerance.
`BasicPreprocessingPipeline`	`preprocessing`	Preprocessing pipeline.
`AttrString`	`queryHint`	Query terms used to retrieve documents.
`AttrDouble`	`scoreWeight`	Balance between cluster score and size during cluster sorting.
`AttrDouble`	`singleTermBoost`	Single term boost.

Fields inherited from class org.carrot2.attrs.AttrComposite

attributes

Constructor Summary

Constructors

Constructor Description

STCClusteringAlgorithm()

Method Summary

Modifier and Type	Method	Description
`<T extends Document> List<Cluster<T>>`	`cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)`	Performs STC clustering of documents.
`Set<Class<?>>`	`requiredLanguageComponents()`

Methods inherited from class org.carrot2.attrs.AttrComposite

accept

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.carrot2.attrs.AcceptingVisitor

accept

Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm

supports

Field Details
- NAME
  
  public static final String NAME
  
  See Also:
  
  Constant Field Values
- queryHint
  
  public final AttrString queryHint
  
  Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
- ignoreWordIfInHigherDocsPercent
  
  public AttrDouble ignoreWordIfInHigherDocsPercent
  
  Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
- minBaseClusterScore
  
  public AttrDouble minBaseClusterScore
  
  Minimum base cluster score (before coverage merging).
- minBaseClusterSize
  
  public AttrInteger minBaseClusterSize
  
  Minimum documents per base cluster.
- maxBaseClusters
  
  public AttrInteger maxBaseClusters
  
  Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.
- maxClusters
  
  public AttrInteger maxClusters
  
  Maximum number of final clusters.
- mergeThreshold
  
  public AttrDouble mergeThreshold
  
  Base cluster merge threshold.
- maxPhraseOverlap
  
  public AttrDouble maxPhraseOverlap
  
  Maximum cluster phrase overlap.
- mostGeneralPhraseCoverage
  
  public AttrDouble mostGeneralPhraseCoverage
  
  Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.
- maxWordsPerLabel
  
  public AttrInteger maxWordsPerLabel
  
  Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.
- maxPhrasesPerLabel
  
  public AttrInteger maxPhrasesPerLabel
  
  Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.
- singleTermBoost
  
  public AttrDouble singleTermBoost
  
  Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
- optimalPhraseLength
  
  public AttrInteger optimalPhraseLength
  
  Optimal label length. A factor in calculation of the base cluster score.
- optimalPhraseLengthDev
  
  public AttrDouble optimalPhraseLengthDev
  
  Phrase length tolerance. A factor in calculation of the base cluster score.
- documentCountBoost
  
  public AttrDouble documentCountBoost
  
  Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
- scoreWeight
  
  public AttrDouble scoreWeight
  
  Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
- mergeStemEquivalentBaseClusters
  
  public AttrBoolean mergeStemEquivalentBaseClusters
  
  Merge all stem-equivalent base clusters before running the merge phase.
  
  See Also:
  
  "http://issues.carrot2.org/browse/CARROT-1008"
- preprocessing
  
  public BasicPreprocessingPipeline preprocessing
  
  Preprocessing pipeline.
Constructor Details
- STCClusteringAlgorithm
  
  public STCClusteringAlgorithm()
Method Details
- requiredLanguageComponents
  
  public Set<Class<?>> requiredLanguageComponents()
  
  Specified by:
  
  requiredLanguageComponents in interface ClusteringAlgorithm
- cluster
  
  public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)
  
  Performs STC clustering of documents.
  
  Specified by:
  
  cluster in interface ClusteringAlgorithm

Class STCClusteringAlgorithm

Field Summary

Fields inherited from class org.carrot2.attrs.AttrComposite

Constructor Summary

Method Summary

Methods inherited from class org.carrot2.attrs.AttrComposite

Methods inherited from class java.lang.Object

Methods inherited from interface org.carrot2.attrs.AcceptingVisitor

Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm

Field Details

Constructor Details

Method Details