Class STCClusteringAlgorithm

  • All Implemented Interfaces:
    AcceptingVisitor, ClusteringAlgorithm

    public final class STCClusteringAlgorithm
    extends AttrComposite
    implements ClusteringAlgorithm
    Suffix Tree Clustering (STC) algorithm. Pretty much as described in: Oren Zamir, Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, 1999. Some liberties were taken wherever STC's description was not clear enough or where we thought some improvements could be made.
    • Field Detail

      • queryHint

        public final AttrString queryHint
        Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters.
      • ignoreWordIfInHigherDocsPercent

        public AttrDouble ignoreWordIfInHigherDocsPercent
        Ignore words appearing in more than the provided fraction of documents. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
      • minBaseClusterScore

        public AttrDouble minBaseClusterScore
        Minimum base cluster score, before coverage merging.
      • minBaseClusterSize

        public AttrInteger minBaseClusterSize
        Minimum required number of documents in a base cluster.
      • maxBaseClusters

        public AttrInteger maxBaseClusters
        Maximum number of base cluster. Trims the base cluster array after N-th position for the merging phase.
      • maxClusters

        public AttrInteger maxClusters
        Maximum number final clusters to keep. Clusters beyond the maximum will be discarded.
      • mergeThreshold

        public AttrDouble mergeThreshold
        Base cluster merge threshold.
      • maxPhraseOverlap

        public AttrDouble maxPhraseOverlap
        Maximum cluster phrase overlap.
      • mostGeneralPhraseCoverage

        public AttrDouble mostGeneralPhraseCoverage
        Minimum coverage required for a phrase to appear in cluster description.
      • maxWordsPerLabel

        public AttrInteger maxWordsPerLabel
        Maximum allowed number of words per label. Base clusters formed by phrases with more words than this ratio will be trimmed.
      • maxPhrasesPerLabel

        public AttrInteger maxPhrasesPerLabel
        Maximum number of phrases from base clusters to promote to the cluster's label.
      • singleTermBoost

        public AttrDouble singleTermBoost
        Base cluster score override for single-term clusters. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
      • optimalPhraseLength

        public AttrInteger optimalPhraseLength
        Optimal label length. A factor in calculation of the base cluster score.
      • optimalPhraseLengthDev

        public AttrDouble optimalPhraseLengthDev
        Optimal cluster label length's tolerance. A factor in calculation of the base cluster score.
      • documentCountBoost

        public AttrDouble documentCountBoost
        Base cluster document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
      • scoreWeight

        public AttrDouble scoreWeight
        Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
      • mergeStemEquivalentBaseClusters

        public AttrBoolean mergeStemEquivalentBaseClusters
        Merge all stem-equivalent base clusters before running the merge phase.
        See Also:
        "http://issues.carrot2.org/browse/CARROT-1008"
      • dictionaries

        public EphemeralDictionaries dictionaries
        Per-request overrides of language components (dictionaries).
        Since:
        4.1.0
    • Constructor Detail

      • STCClusteringAlgorithm

        public STCClusteringAlgorithm()