Package org.carrot2.text.preprocessing
Class CompletePreprocessingPipeline
- java.lang.Object
-
- org.carrot2.attrs.AttrComposite
-
- org.carrot2.text.preprocessing.CompletePreprocessingPipeline
-
- All Implemented Interfaces:
AcceptingVisitor,ContextPreprocessor
public class CompletePreprocessingPipeline extends AttrComposite implements ContextPreprocessor
Performs a complete preprocessing on the provided documents. The preprocessing consists of the following steps:InputTokenizerCaseNormalizerLanguageModelStemmerStopListMarkerPhraseExtractorLabelFilterProcessorDocumentAssigner
-
-
Field Summary
Fields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizercaseNormalizerCase normalizer used by the algorithm.DocumentAssignerdocumentAssignerDocument assigner used by the algorithm, contains bindable attributes.LabelFilterProcessorlabelFiltersLabel filtering is a composite of individual filters.AttrIntegerphraseDfThresholdPhrase Document Frequency threshold.protected org.carrot2.text.preprocessing.LanguageModelStemmerstemmingStemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarkerstopListMarkerStop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizertokenizerTokenizer used by the algorithm.AttrIntegerwordDfThresholdWord Document Frequency threshold.-
Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
-
Constructor Summary
Constructors Constructor Description CompletePreprocessingPipeline()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description PreprocessingContextpreprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)-
Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
-
-
-
Field Detail
-
wordDfThreshold
public final AttrInteger wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thanwordDfThresholddocuments will be ignored.
-
phraseDfThreshold
public final AttrInteger phraseDfThreshold
Phrase Document Frequency threshold. Phrases appearing in fewer thanphraseDfThresholddocuments will be ignored.
-
labelFilters
public LabelFilterProcessor labelFilters
Label filtering is a composite of individual filters.
-
documentAssigner
public DocumentAssigner documentAssigner
Document assigner used by the algorithm, contains bindable attributes.
-
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
Case normalizer used by the algorithm.
-
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
Stemmer used by the algorithm.
-
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
Stop list marker used by the algorithm, contains bindable attributes.
-
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
Tokenizer used by the algorithm.
-
-
Method Detail
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
- Specified by:
preprocessin interfaceContextPreprocessor
-
-