java.lang.Object
- org.carrot2.attrs.AttrComposite
- - org.carrot2.clustering.stc.STCClusteringAlgorithm

All Implemented Interfaces:

AcceptingVisitor, ClusteringAlgorithm
```
public final class STCClusteringAlgorithm
extends AttrComposite
implements ClusteringAlgorithm
```
Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.

Field Summary

Fields
Modifier and Type	Field	Description
`EphemeralDictionaries`	`dictionaries`	Per-request overrides of language components (dictionaries).
`AttrDouble`	`documentCountBoost`	Base cluster document count boost.
`AttrDouble`	`ignoreWordIfInHigherDocsPercent`	Ignore words appearing in more than the provided fraction of documents.
`AttrInteger`	`maxBaseClusters`	Maximum number of base cluster.
`AttrInteger`	`maxClusters`	Maximum number final clusters to keep.
`AttrDouble`	`maxPhraseOverlap`	Maximum cluster phrase overlap.
`AttrInteger`	`maxPhrasesPerLabel`	Maximum number of phrases from base clusters to promote to the cluster's label.
`AttrInteger`	`maxWordsPerLabel`	Maximum allowed number of words per label.
`AttrBoolean`	`mergeStemEquivalentBaseClusters`	Merge all stem-equivalent base clusters before running the merge phase.
`AttrDouble`	`mergeThreshold`	Base cluster merge threshold.
`AttrDouble`	`minBaseClusterScore`	Minimum base cluster score, before coverage merging.
`AttrInteger`	`minBaseClusterSize`	Minimum required number of documents in a base cluster.
`AttrDouble`	`mostGeneralPhraseCoverage`	Minimum coverage required for a phrase to appear in cluster description.
`static String`	`NAME`
`AttrInteger`	`optimalPhraseLength`	Optimal label length.
`AttrDouble`	`optimalPhraseLengthDev`	Optimal cluster label length's tolerance.
`BasicPreprocessingPipeline`	`preprocessing`	Configuration of the text preprocessing stage.
`AttrString`	`queryHint`	Query terms used to retrieve documents.
`AttrDouble`	`scoreWeight`	Balance between cluster score and size during cluster sorting.
`AttrDouble`	`singleTermBoost`	Base cluster score override for single-term clusters.

Fields inherited from class org.carrot2.attrs.AttrComposite
attributes

Constructor Summary

Constructors
Constructor Description

STCClusteringAlgorithm()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`<T extends Document> List<Cluster<T>>`	`cluster(Stream<? extends T> docStream, LanguageComponents languageComponents)`	Performs STC clustering of documents.
`Set<Class<?>>`	`requiredLanguageComponents()`

Methods inherited from class org.carrot2.attrs.AttrComposite
accept

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.carrot2.attrs.AcceptingVisitor
accept

Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm
supports

- Field Detail
  - NAME
```
public static final String NAME
```
    See Also:
    
    Constant Field Values
  - queryHint
```
public final AttrString queryHint
```
    Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
  - ignoreWordIfInHigherDocsPercent
```
public AttrDouble ignoreWordIfInHigherDocsPercent
```
    Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
  - minBaseClusterScore
```
public AttrDouble minBaseClusterScore
```
    Minimum base cluster score, before coverage merging.
  - minBaseClusterSize
```
public AttrInteger minBaseClusterSize
```
    Minimum required number of documents in a base cluster.
  - maxBaseClusters
```
public AttrInteger maxBaseClusters
```
    Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.
  - maxClusters
```
public AttrInteger maxClusters
```
    Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.
  - mergeThreshold
```
public AttrDouble mergeThreshold
```
    Base cluster merge threshold.
  - maxPhraseOverlap
```
public AttrDouble maxPhraseOverlap
```
    Maximum cluster phrase overlap.
  - mostGeneralPhraseCoverage
```
public AttrDouble mostGeneralPhraseCoverage
```
    Minimum coverage required for a phrase to appear in cluster description.
  - maxWordsPerLabel
```
public AttrInteger maxWordsPerLabel
```
    Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.
  - maxPhrasesPerLabel
```
public AttrInteger maxPhrasesPerLabel
```
    Maximum number of phrases from base clusters to promote to the cluster's label.
  - singleTermBoost
```
public AttrDouble singleTermBoost
```
    Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
  - optimalPhraseLength
```
public AttrInteger optimalPhraseLength
```
    Optimal label length. A factor in calculation of the base cluster score.
  - optimalPhraseLengthDev
```
public AttrDouble optimalPhraseLengthDev
```
    Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.
  - documentCountBoost
```
public AttrDouble documentCountBoost
```
    Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
  - scoreWeight
```
public AttrDouble scoreWeight
```
    Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
  - mergeStemEquivalentBaseClusters
```
public AttrBoolean mergeStemEquivalentBaseClusters
```
    Merge all stem-equivalent base clusters before running the merge phase.
    
    See Also:
    
    "http://issues.carrot2.org/browse/CARROT-1008"
  - preprocessing
```
public BasicPreprocessingPipeline preprocessing
```
    Configuration of the text preprocessing stage.
  - dictionaries
```
public EphemeralDictionaries dictionaries
```
    Per-request overrides of language components (dictionaries).
    
    Since:
    
    4.1.0
- Constructor Detail
  - STCClusteringAlgorithm
```
public STCClusteringAlgorithm()
```
- Method Detail
  - requiredLanguageComponents
```
public Set<Class<?>> requiredLanguageComponents()
```
    Specified by:
    
    requiredLanguageComponents in interface ClusteringAlgorithm
  - cluster
```
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream,
                                                     LanguageComponents languageComponents)
```
    Performs STC clustering of documents.
    
    Specified by:
    
    cluster in interface ClusteringAlgorithm

Class STCClusteringAlgorithm

Field Summary

Fields inherited from class org.carrot2.attrs.AttrComposite

Constructor Summary

Method Summary

Methods inherited from class org.carrot2.attrs.AttrComposite

Methods inherited from class java.lang.Object

Methods inherited from interface org.carrot2.attrs.AcceptingVisitor

Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm

Field Detail

NAME

queryHint

ignoreWordIfInHigherDocsPercent

minBaseClusterScore

minBaseClusterSize

maxBaseClusters

maxClusters

mergeThreshold

maxPhraseOverlap

mostGeneralPhraseCoverage

maxWordsPerLabel

maxPhrasesPerLabel

singleTermBoost

optimalPhraseLength

optimalPhraseLengthDev

documentCountBoost

scoreWeight

mergeStemEquivalentBaseClusters

preprocessing

dictionaries

Constructor Detail

STCClusteringAlgorithm

Method Detail

requiredLanguageComponents

cluster