Class DocumentAssigner

java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.DocumentAssigner
All Implemented Interfaces:
AcceptingVisitor

public class DocumentAssigner
extends AttrComposite
Assigns document to label candidates. For each label candidate from PreprocessingContext.AllLabels.featureIndex an BitSet with the assigned documents is constructed. The assignment algorithm is rather simple: in order to be assigned to a label, a document must contain at least one occurrence of each non-stop word from the label.

This class saves the following results to the PreprocessingContext :

This class requires that InputTokenizer, CaseNormalizer, StopListMarker, PhraseExtractor and LabelFilterProcessor be invoked first.

  • Field Details

    • exactPhraseAssignment

      public AttrBoolean exactPhraseAssignment
      Only exact phrase assignments. When set to true, clusters will contain only the documents that contain the cluster's label in its original form, including the order of words. Enabling this option will cause fewer documents to be put in clusters, increasing the precision of assignment, but also increasing the "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments.
    • minClusterSize

      public AttrInteger minClusterSize
      Determines the minimum number of documents in each cluster.
  • Constructor Details