Package org.carrot2.text.preprocessing
Class BasicPreprocessingPipeline
- java.lang.Object
- 
- org.carrot2.attrs.AttrComposite
- 
- org.carrot2.text.preprocessing.BasicPreprocessingPipeline
 
 
- 
- All Implemented Interfaces:
- AcceptingVisitor,- ContextPreprocessor
 
 public class BasicPreprocessingPipeline extends AttrComposite implements ContextPreprocessor Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:- InputTokenizer
- CaseNormalizer
- LanguageModelStemmer
- StopListMarker
 
- 
- 
Field SummaryFields Modifier and Type Field Description protected org.carrot2.text.preprocessing.CaseNormalizercaseNormalizerCase normalizer used by the algorithm.protected org.carrot2.text.preprocessing.LanguageModelStemmerstemmingStemmer used by the algorithm.protected org.carrot2.text.preprocessing.StopListMarkerstopListMarkerStop list marker used by the algorithm, contains bindable attributes.protected org.carrot2.text.preprocessing.InputTokenizertokenizerTokenizer used by the algorithm.AttrIntegerwordDfThresholdWord Document Frequency threshold.- 
Fields inherited from class org.carrot2.attrs.AttrCompositeattributes
 
- 
 - 
Constructor SummaryConstructors Constructor Description BasicPreprocessingPipeline()
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description PreprocessingContextpreprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)Performs preprocessing on the provided list of documents.- 
Methods inherited from class org.carrot2.attrs.AttrCompositeaccept
 
- 
 
- 
- 
- 
Field Detail- 
wordDfThresholdpublic final AttrInteger wordDfThreshold Word Document Frequency threshold. Words appearing in fewer thandfThresholddocuments will be ignored.
 - 
caseNormalizerprotected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer Case normalizer used by the algorithm.
 - 
stemmingprotected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming Stemmer used by the algorithm.
 - 
stopListMarkerprotected final org.carrot2.text.preprocessing.StopListMarker stopListMarker Stop list marker used by the algorithm, contains bindable attributes.
 - 
tokenizerprotected final org.carrot2.text.preprocessing.InputTokenizer tokenizer Tokenizer used by the algorithm.
 
- 
 - 
Method Detail- 
preprocesspublic PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel) Performs preprocessing on the provided list of documents. Results can be obtained from the returnedPreprocessingContext.- Specified by:
- preprocessin interface- ContextPreprocessor
 
 
- 
 
-