Package org.carrot2.text.preprocessing
Class BasicPreprocessingPipeline
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.BasicPreprocessingPipeline
- All Implemented Interfaces:
AcceptingVisitor
,ContextPreprocessor
public class BasicPreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the
following steps:
InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizer
caseNormalizer
Case normalizer used by the algorithm.protected org.carrot2.text.preprocessing.LanguageModelStemmer
stemming
Stemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarker
stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizer
tokenizer
Tokenizer used by the algorithm.AttrInteger
wordDfThreshold
Word Document Frequency threshold. -
Constructor Summary
Constructors Constructor Description BasicPreprocessingPipeline()
-
Method Summary
Modifier and Type Method Description PreprocessingContext
preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
Performs preprocessing on the provided list of documents.
-
Field Details
-
wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thandfThreshold
documents will be ignored. -
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizerCase normalizer used by the algorithm. -
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemmingStemmer used by the algorithm. -
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarkerStop list marker used by the algorithm, contains bindable attributes. -
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizerTokenizer used by the algorithm.
-
-
Constructor Details
-
BasicPreprocessingPipeline
public BasicPreprocessingPipeline()
-
-
Method Details
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)Performs preprocessing on the provided list of documents. Results can be obtained from the returnedPreprocessingContext
.- Specified by:
preprocess
in interfaceContextPreprocessor
-