java.lang.Object
- org.carrot2.attrs.AttrComposite
- - org.carrot2.text.preprocessing.BasicPreprocessingPipeline

All Implemented Interfaces:

AcceptingVisitor, ContextPreprocessor
```
public class BasicPreprocessingPipeline
extends AttrComposite
implements ContextPreprocessor
```
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the following steps:
1. InputTokenizer
2. CaseNormalizer
3. LanguageModelStemmer
4. StopListMarker

Field Summary

Fields
Modifier and Type	Field	Description
`protected org.carrot2.text.preprocessing.CaseNormalizer`	`caseNormalizer`	Case normalizer used by the algorithm.
`protected org.carrot2.text.preprocessing.LanguageModelStemmer`	`stemming`	Stemmer used by the algorithm.
`protected org.carrot2.text.preprocessing.StopListMarker`	`stopListMarker`	Stop list marker used by the algorithm, contains bindable attributes.
`protected org.carrot2.text.preprocessing.InputTokenizer`	`tokenizer`	Tokenizer used by the algorithm.
`AttrInteger`	`wordDfThreshold`	Word Document Frequency threshold.

Fields inherited from class org.carrot2.attrs.AttrComposite
attributes

Constructor Summary

Constructors
Constructor Description

BasicPreprocessingPipeline()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`PreprocessingContext`	`preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)`	Performs preprocessing on the provided list of documents.

Methods inherited from class org.carrot2.attrs.AttrComposite
accept

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

wordDfThreshold
```
public final AttrInteger wordDfThreshold
```
Word Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.

caseNormalizer

protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer

Case normalizer used by the algorithm.

stemming

protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming

Stemmer used by the algorithm.

stopListMarker
```
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
```
Stop list marker used by the algorithm, contains bindable attributes.

tokenizer

protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer

Tokenizer used by the algorithm.

Constructor Detail
- BasicPreprocessingPipeline
```
public BasicPreprocessingPipeline()
```

Method Detail

preprocess

public PreprocessingContext preprocess(Stream<? extends Document> documents,
                                       String query,
                                       LanguageComponents langModel)

Performs preprocessing on the provided list of documents. Results can be obtained from the returned PreprocessingContext.

Specified by:: preprocess in interface ContextPreprocessor

Class BasicPreprocessingPipeline

Field Summary

Fields inherited from class org.carrot2.attrs.AttrComposite

Constructor Summary

Method Summary

Methods inherited from class org.carrot2.attrs.AttrComposite

Methods inherited from class java.lang.Object

Field Detail

wordDfThreshold

caseNormalizer

stemming

stopListMarker

tokenizer

Constructor Detail

BasicPreprocessingPipeline

Method Detail

preprocess