Package org.carrot2.text.preprocessing
Class CompletePreprocessingPipeline
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.CompletePreprocessingPipeline
- All Implemented Interfaces:
AcceptingVisitor
,ContextPreprocessor
public class CompletePreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs a complete preprocessing on the provided documents. The preprocessing consists of the
following steps:
InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
PhraseExtractor
LabelFilterProcessor
DocumentAssigner
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizer
caseNormalizer
Case normalizer used by the algorithm.DocumentAssigner
documentAssigner
Document assigner used by the algorithm, contains bindable attributes.LabelFilterProcessor
labelFilters
Label filtering is a composite of individual filters.AttrInteger
phraseDfThreshold
Phrase Document Frequency threshold.protected org.carrot2.text.preprocessing.LanguageModelStemmer
stemming
Stemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarker
stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizer
tokenizer
Tokenizer used by the algorithm.AttrInteger
wordDfThreshold
Word Document Frequency threshold. -
Constructor Summary
Constructors Constructor Description CompletePreprocessingPipeline()
-
Method Summary
Modifier and Type Method Description PreprocessingContext
preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
-
Field Details
-
wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thanwordDfThreshold
documents will be ignored. -
phraseDfThreshold
Phrase Document Frequency threshold. Phrases appearing in fewer thanphraseDfThreshold
documents will be ignored. -
labelFilters
Label filtering is a composite of individual filters. -
documentAssigner
Document assigner used by the algorithm, contains bindable attributes. -
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizerCase normalizer used by the algorithm. -
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemmingStemmer used by the algorithm. -
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarkerStop list marker used by the algorithm, contains bindable attributes. -
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizerTokenizer used by the algorithm.
-
-
Constructor Details
-
CompletePreprocessingPipeline
public CompletePreprocessingPipeline()
-
-
Method Details
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)- Specified by:
preprocess
in interfaceContextPreprocessor
-