Class BasicPreprocessingPipeline

java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.BasicPreprocessingPipeline
All Implemented Interfaces:
AcceptingVisitor, ContextPreprocessor

public class BasicPreprocessingPipeline
extends AttrComposite
implements ContextPreprocessor
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:
  1. InputTokenizer
  2. CaseNormalizer
  3. LanguageModelStemmer
  4. StopListMarker
  • Field Details

    • wordDfThreshold

      public final AttrInteger wordDfThreshold
      Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
    • caseNormalizer

      protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
      Case normalizer used by the algorithm.
    • stemming

      protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
      Stemmer used by the algorithm.
    • stopListMarker

      protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
      Stop list marker used by the algorithm, contains bindable attributes.
    • tokenizer

      protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
      Tokenizer used by the algorithm.
  • Constructor Details

  • Method Details