Package org.carrot2.text.preprocessing
Class BasicPreprocessingPipeline
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.text.preprocessing.BasicPreprocessingPipeline
-
- All Implemented Interfaces:
AcceptingVisitor,ContextPreprocessor
public class BasicPreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:InputTokenizerCaseNormalizerLanguageModelStemmerStopListMarker
-
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizercaseNormalizerCase normalizer used by the algorithm.protected org.carrot2.text.preprocessing.LanguageModelStemmerstemmingStemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarkerstopListMarkerStop list marker used by the algorithm, contains modifiable parameters.protected org.carrot2.text.preprocessing.InputTokenizertokenizerTokenizer used by the algorithm.AttrIntegerwordDfThresholdWord Document Frequency threshold.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description BasicPreprocessingPipeline()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description PreprocessingContextpreprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)Performs preprocessing on the provided list of documents.-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
-
-
-
Field Detail
-
wordDfThreshold
public final AttrInteger wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thandfThresholddocuments will be ignored.
-
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
Case normalizer used by the algorithm.
-
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
Stemmer used by the algorithm.
-
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
Stop list marker used by the algorithm, contains modifiable parameters.
-
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
Tokenizer used by the algorithm.
-
-
Method Detail
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
Performs preprocessing on the provided list of documents. Results can be obtained from the returnedPreprocessingContext.- Specified by:
preprocessin interfaceContextPreprocessor
-
-