Class PreprocessingContext.AllTokens
- java.lang.Object
-
- org.carrot2.text.preprocessing.PreprocessingContext.AllTokens
-
- Enclosing class:
- PreprocessingContext
public class PreprocessingContext.AllTokens extends Object
Information about all tokens of the input documents. Each element of each of the arrays corresponds to one individual token from the input or a synthetic separator inserted between documents, fields and sentences. Last element of this array is a special terminator entry.All arrays in this class have the same length and values across different arrays correspond to each other for the same index.
-
-
Field Summary
Fields Modifier and Type Field Description int[]documentIndexIndex of the document this token came from, points to elements of documents.byte[]fieldIndexDocument field the token came from.char[][]imageToken image as it appears in the input.int[]lcpThe Longest Common Prefix for the adjacent suffix-sorted token sequences.int[]suffixOrderThe suffix order of tokens.short[]typeToken'sTokenizerbit flags.int[]wordIndexA pointer toPreprocessingContext.AllWordsarrays for this token.
-
Constructor Summary
Constructors Constructor Description AllTokens()
-
-
-
Field Detail
-
image
public char[][] image
Token image as it appears in the input. On positions wheretypeis equal to one ofTokenizer.TF_TERMINATOR,Tokenizer.TF_SEPARATOR_DOCUMENTorTokenizer.TF_SEPARATOR_FIELD, image isnull.This array is produced by
InputTokenizer.
-
type
public short[] type
Token'sTokenizerbit flags.This array is produced by
InputTokenizer.
-
fieldIndex
public byte[] fieldIndex
Document field the token came from. The index points to arrays inPreprocessingContext.AllFields, equal to-1for document and field separators.This array is produced by
InputTokenizer.
-
documentIndex
public int[] documentIndex
Index of the document this token came from, points to elements of documents. Equal to-1for document separators.This array is produced by
InputTokenizer.This array is accessed in
CaseNormalizerandPhraseExtractorto compute by-document statistics, e.g. tf-by document, which are then needed to build a VSM or assign documents to labels. An alternative to this representation would be creating anAllDocumentsholder and keep there an array of start token indexes for each document and then refactor the model building code to do a binary search to determine the document index given token index. This is likely to be a significant performance hit because model building code accesses the documentIndex array pretty much randomly (in the suffix order), so we'd be doing twice-the-number-of-tokens binary searches. Unless there's some other data structure that can help us here.
-
wordIndex
public int[] wordIndex
A pointer toPreprocessingContext.AllWordsarrays for this token. Equal to-1for document, field andTokenizer.TT_PUNCTUATIONtokens (including sentence separators).This array is produced by
CaseNormalizer.
-
suffixOrder
public int[] suffixOrder
The suffix order of tokens. Suffixes starting with a separator come at the end of the array.This array is produced by
PhraseExtractor.
-
lcp
public int[] lcp
The Longest Common Prefix for the adjacent suffix-sorted token sequences.This array is produced by
PhraseExtractor.
-
-