Package org.carrot2.text.preprocessing
Class CompletePreprocessingPipeline
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.CompletePreprocessingPipeline
- All Implemented Interfaces:
AcceptingVisitor,ContextPreprocessor
public class CompletePreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs a complete preprocessing on the provided documents. The preprocessing consists of the
following steps:
InputTokenizerCaseNormalizerLanguageModelStemmerStopListMarkerPhraseExtractorLabelFilterProcessorDocumentAssigner
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizercaseNormalizerCase normalizer used by the algorithm.DocumentAssignerdocumentAssignerDocument assigner used by the algorithm, contains bindable attributes.LabelFilterProcessorlabelFiltersLabel filtering is a composite of individual filters.AttrIntegerphraseDfThresholdPhrase Document Frequency threshold.protected org.carrot2.text.preprocessing.LanguageModelStemmerstemmingStemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarkerstopListMarkerStop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizertokenizerTokenizer used by the algorithm.AttrIntegerwordDfThresholdWord Document Frequency threshold. -
Constructor Summary
Constructors Constructor Description CompletePreprocessingPipeline() -
Method Summary
Modifier and Type Method Description PreprocessingContextpreprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
-
Field Details
-
wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thanwordDfThresholddocuments will be ignored. -
phraseDfThreshold
Phrase Document Frequency threshold. Phrases appearing in fewer thanphraseDfThresholddocuments will be ignored. -
labelFilters
Label filtering is a composite of individual filters. -
documentAssigner
Document assigner used by the algorithm, contains bindable attributes. -
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizerCase normalizer used by the algorithm. -
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemmingStemmer used by the algorithm. -
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarkerStop list marker used by the algorithm, contains bindable attributes. -
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizerTokenizer used by the algorithm.
-
-
Constructor Details
-
CompletePreprocessingPipeline
public CompletePreprocessingPipeline()
-
-
Method Details
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)- Specified by:
preprocessin interfaceContextPreprocessor
-