Package org.carrot2.text.preprocessing
Class DocumentAssigner
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.text.preprocessing.DocumentAssigner
-
- All Implemented Interfaces:
AcceptingVisitor
public class DocumentAssigner extends AttrComposite
Assigns document to label candidates. For each label candidate fromPreprocessingContext.AllLabels.featureIndex
anBitSet
with the assigned documents is constructed. The assignment algorithm is rather simple: in order to be assigned to a label, a document must contain at least one occurrence of each non-stop word from the label.This class saves the following results to the
PreprocessingContext
:This class requires that
InputTokenizer
,CaseNormalizer
,StopListMarker
,PhraseExtractor
andLabelFilterProcessor
be invoked first.
-
-
Field Summary
Fields Modifier and Type Field Description AttrBoolean
exactPhraseAssignment
Only exact phrase assignments.AttrInteger
minClusterSize
Minimum required number of documents in each cluster.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description DocumentAssigner()
-
-
-
Field Detail
-
exactPhraseAssignment
public AttrBoolean exactPhraseAssignment
Only exact phrase assignments. When set totrue
, clusters will contain only the documents that contain the cluster's label in its original form, including the order of words. Enabling this option will cause fewer documents to be put in clusters, increasing the precision of assignment, but also increasing the "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments.
-
minClusterSize
public AttrInteger minClusterSize
Minimum required number of documents in each cluster. Clusters containing fewer documents will not be created.
-
-