Class STCClusteringAlgorithm

  • All Implemented Interfaces:
    AcceptingVisitor, ClusteringAlgorithm

    public final class STCClusteringAlgorithm
    extends AttrComposite
    implements ClusteringAlgorithm
    Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.
    • Field Detail

      • queryHint

        public final AttrString queryHint
        Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
      • ignoreWordIfInHigherDocsPercent

        public AttrDouble ignoreWordIfInHigherDocsPercent
        Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
      • minBaseClusterScore

        public AttrDouble minBaseClusterScore
        Minimum base cluster score (before coverage merging).
      • minBaseClusterSize

        public AttrInteger minBaseClusterSize
        Minimum documents per base cluster.
      • maxBaseClusters

        public AttrInteger maxBaseClusters
        Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase.
      • maxClusters

        public AttrInteger maxClusters
        Maximum number of final clusters.
      • mergeThreshold

        public AttrDouble mergeThreshold
        Base cluster merge threshold.
      • maxPhraseOverlap

        public AttrDouble maxPhraseOverlap
        Maximum cluster phrase overlap.
      • mostGeneralPhraseCoverage

        public AttrDouble mostGeneralPhraseCoverage
        Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description.
      • maxWordsPerLabel

        public AttrInteger maxWordsPerLabel
        Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.
      • maxPhrasesPerLabel

        public AttrInteger maxPhrasesPerLabel
        Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.
      • singleTermBoost

        public AttrDouble singleTermBoost
        Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
      • optimalPhraseLength

        public AttrInteger optimalPhraseLength
        Optimal label length. A factor in calculation of the base cluster score.
      • optimalPhraseLengthDev

        public AttrDouble optimalPhraseLengthDev
        Phrase length tolerance. A factor in calculation of the base cluster score.
      • documentCountBoost

        public AttrDouble documentCountBoost
        Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
      • scoreWeight

        public AttrDouble scoreWeight
        Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
      • mergeStemEquivalentBaseClusters

        public AttrBoolean mergeStemEquivalentBaseClusters
        Merge all stem-equivalent base clusters before running the merge phase.
        See Also:
        "http://issues.carrot2.org/browse/CARROT-1008"
    • Constructor Detail

      • STCClusteringAlgorithm

        public STCClusteringAlgorithm()