Package org.carrot2.text.preprocessing
Class CompletePreprocessingPipeline
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.text.preprocessing.CompletePreprocessingPipeline
-
- All Implemented Interfaces:
AcceptingVisitor
,ContextPreprocessor
public class CompletePreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs a complete preprocessing on the provided documents. The preprocessing consists of the following steps:InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
PhraseExtractor
LabelFilterProcessor
DocumentAssigner
-
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizer
caseNormalizer
Case normalizer used by the algorithm.DocumentAssigner
documentAssigner
Document assigner used by the algorithm, contains bindable attributes.LabelFilterProcessor
labelFilters
Label filtering is a composite of individual filters.AttrInteger
phraseDfThreshold
Phrase Document Frequency threshold.protected org.carrot2.text.preprocessing.LanguageModelStemmer
stemming
Stemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarker
stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizer
tokenizer
Tokenizer used by the algorithm.AttrInteger
wordDfThreshold
Word Document Frequency threshold.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description CompletePreprocessingPipeline()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description PreprocessingContext
preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
-
-
-
Field Detail
-
wordDfThreshold
public final AttrInteger wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thanwordDfThreshold
documents will be ignored.
-
phraseDfThreshold
public final AttrInteger phraseDfThreshold
Phrase Document Frequency threshold. Phrases appearing in fewer thanphraseDfThreshold
documents will be ignored.
-
labelFilters
public LabelFilterProcessor labelFilters
Label filtering is a composite of individual filters.
-
documentAssigner
public DocumentAssigner documentAssigner
Document assigner used by the algorithm, contains bindable attributes.
-
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
Case normalizer used by the algorithm.
-
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
Stemmer used by the algorithm.
-
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.
-
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
Tokenizer used by the algorithm.
-
-
Method Detail
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
- Specified by:
preprocess
in interfaceContextPreprocessor
-
-