Package org.carrot2.text.preprocessing
Class BasicPreprocessingPipeline
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.text.preprocessing.BasicPreprocessingPipeline
-
- All Implemented Interfaces:
AcceptingVisitor
,ContextPreprocessor
public class BasicPreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
-
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizer
caseNormalizer
Case normalizer used by the algorithm.protected org.carrot2.text.preprocessing.LanguageModelStemmer
stemming
Stemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarker
stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizer
tokenizer
Tokenizer used by the algorithm.AttrInteger
wordDfThreshold
Word Document Frequency threshold.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description BasicPreprocessingPipeline()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description PreprocessingContext
preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
Performs preprocessing on the provided list of documents.-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
-
-
-
Field Detail
-
wordDfThreshold
public final AttrInteger wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thandfThreshold
documents will be ignored.
-
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
Case normalizer used by the algorithm.
-
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
Stemmer used by the algorithm.
-
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.
-
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
Tokenizer used by the algorithm.
-
-
Method Detail
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
Performs preprocessing on the provided list of documents. Results can be obtained from the returnedPreprocessingContext
.- Specified by:
preprocess
in interfaceContextPreprocessor
-
-