java.lang.Object

org.carrot2.text.preprocessing.CompletePreprocessingPipeline

All Implemented Interfaces:: AcceptingVisitor, ContextPreprocessor

public class CompletePreprocessingPipeline
extends AttrComposite
implements ContextPreprocessor

Performs a complete preprocessing on the provided documents. The preprocessing consists of the following steps:

InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
PhraseExtractor
LabelFilterProcessor
DocumentAssigner

Field Summary

Fields
Modifier and Type	Field	Description
`protected org.carrot2.text.preprocessing.CaseNormalizer`	`caseNormalizer`	Case normalizer used by the algorithm.
`DocumentAssigner`	`documentAssigner`	Document assigner used by the algorithm, contains bindable attributes.
`LabelFilterProcessor`	`labelFilters`	Label filtering is a composite of individual filters.
`AttrInteger`	`phraseDfThreshold`	Phrase Document Frequency threshold.
`protected org.carrot2.text.preprocessing.LanguageModelStemmer`	`stemming`	Stemmer used by the algorithm.
`protected org.carrot2.text.preprocessing.StopListMarker`	`stopListMarker`	Stop list marker used by the algorithm, contains bindable attributes.
`protected org.carrot2.text.preprocessing.InputTokenizer`	`tokenizer`	Tokenizer used by the algorithm.
`AttrInteger`	`wordDfThreshold`	Word Document Frequency threshold.

Fields inherited from class org.carrot2.attrs.AttrComposite

attributes

Constructor Summary

Constructors

Constructor Description

CompletePreprocessingPipeline()
Method Summary

Modifier and Type Method Description

PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)

Methods inherited from class org.carrot2.attrs.AttrComposite
accept

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- wordDfThreshold
  
  public final AttrInteger wordDfThreshold
  
  Word Document Frequency threshold. Words appearing in fewer than wordDfThreshold documents will be ignored.
- phraseDfThreshold
  
  public final AttrInteger phraseDfThreshold
  
  Phrase Document Frequency threshold. Phrases appearing in fewer than phraseDfThreshold documents will be ignored.
- labelFilters
  
  public LabelFilterProcessor labelFilters
  
  Label filtering is a composite of individual filters.
- documentAssigner
  
  public DocumentAssigner documentAssigner
  
  Document assigner used by the algorithm, contains bindable attributes.
- caseNormalizer
  
  protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizer
  
  Case normalizer used by the algorithm.
- stemming
  
  protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemming
  
  Stemmer used by the algorithm.
- stopListMarker
  
  protected final org.carrot2.text.preprocessing.StopListMarker stopListMarker
  
  Stop list marker used by the algorithm, contains bindable attributes.
- tokenizer
  
  protected final org.carrot2.text.preprocessing.InputTokenizer tokenizer
  
  Tokenizer used by the algorithm.
Constructor Details
- CompletePreprocessingPipeline
  
  public CompletePreprocessingPipeline()
Method Details
- preprocess
  
  public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel)
  
  Specified by:
  
  preprocess in interface ContextPreprocessor

Class CompletePreprocessingPipeline

Field Summary

Fields inherited from class org.carrot2.attrs.AttrComposite

Constructor Summary

Method Summary

Methods inherited from class org.carrot2.attrs.AttrComposite

Methods inherited from class java.lang.Object

Field Details

Constructor Details

Method Details