Package org.carrot2.text.preprocessing
Class BasicPreprocessingPipeline
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.BasicPreprocessingPipeline
- All Implemented Interfaces:
AcceptingVisitor,ContextPreprocessor
public class BasicPreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the
following steps:
InputTokenizerCaseNormalizerLanguageModelStemmerStopListMarker
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizercaseNormalizerCase normalizer used by the algorithm.protected org.carrot2.text.preprocessing.LanguageModelStemmerstemmingStemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarkerstopListMarkerStop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizertokenizerTokenizer used by the algorithm.AttrIntegerwordDfThresholdWord Document Frequency threshold. -
Constructor Summary
Constructors Constructor Description BasicPreprocessingPipeline() -
Method Summary
Modifier and Type Method Description PreprocessingContextpreprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)Performs preprocessing on the provided list of documents.
-
Field Details
-
wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thandfThresholddocuments will be ignored. -
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizerCase normalizer used by the algorithm. -
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemmingStemmer used by the algorithm. -
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarkerStop list marker used by the algorithm, contains bindable attributes. -
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizerTokenizer used by the algorithm.
-
-
Constructor Details
-
BasicPreprocessingPipeline
public BasicPreprocessingPipeline()
-
-
Method Details
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)Performs preprocessing on the provided list of documents. Results can be obtained from the returnedPreprocessingContext.- Specified by:
preprocessin interfaceContextPreprocessor
-