Class BasicPreprocessingPipeline

  • All Implemented Interfaces:
    AcceptingVisitor, ContextPreprocessor

    public class BasicPreprocessingPipeline
    extends AttrComposite
    implements ContextPreprocessor
    Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:
    1. InputTokenizer
    2. CaseNormalizer
    3. LanguageModelStemmer
    4. StopListMarker
    • Field Detail

      • wordDfThreshold

        public final AttrInteger wordDfThreshold
        Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
      • caseNormalizer

        protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
        Case normalizer used by the algorithm.
      • stemming

        protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
        Stemmer used by the algorithm.
      • stopListMarker

        protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
        Stop list marker used by the algorithm, contains bindable attributes.
      • tokenizer

        protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
        Tokenizer used by the algorithm.
    • Constructor Detail

      • BasicPreprocessingPipeline

        public BasicPreprocessingPipeline()