Package org.carrot2.text.preprocessing
Class PreprocessingContext.AllWords
- java.lang.Object
-
- org.carrot2.text.preprocessing.PreprocessingContext.AllWords
-
- Enclosing class:
- PreprocessingContext
public class PreprocessingContext.AllWords extends Object
Information about all unique words found in the input documents. An entry in each parallel array corresponds to one conflated form of a word. For example, data and DATA will most likely become a single entry in the words table. However, different grammatical forms of a single lemma (like computer and computers) will have different entries in the words table. SeePreprocessingContext.AllStems
for inflection-conflated versions.All arrays in this class have the same length and values across different arrays correspond to each other for the same index.
-
-
Field Summary
Fields Modifier and Type Field Description byte[]
fieldIndices
A bit-packed indices of all fields in which this word appears at least once.char[][]
image
The most frequently appearing variant of the word with respect to case.int[]
stemIndex
A pointer to thePreprocessingContext.AllStems
arrays for this word.int[]
tf
Term Frequency of the word, aggregated across all variants with respect to case.int[][]
tfByDocument
Term Frequency of the word for each document.short[]
type
Token type of this word copied fromPreprocessingContext.AllTokens.type
.
-
Constructor Summary
Constructors Constructor Description AllWords()
-
-
-
Field Detail
-
image
public char[][] image
The most frequently appearing variant of the word with respect to case. E.g. if a token MacOS appeared 12 times in the input and macos appeared 3 times, the image will be equal to MacOS.This array is produced by
CaseNormalizer
.
-
type
public short[] type
Token type of this word copied fromPreprocessingContext.AllTokens.type
. Additional flags are set for each word byCaseNormalizer
andLanguageModelStemmer
.This array is produced by
CaseNormalizer
. This array is modified byLanguageModelStemmer
.- See Also:
Tokenizer
-
tf
public int[] tf
Term Frequency of the word, aggregated across all variants with respect to case. Frequencies for each variant separately are not available.This array is produced by
CaseNormalizer
.
-
tfByDocument
public int[][] tfByDocument
Term Frequency of the word for each document. The length of this array is equal to the number of documents this word appeared in (Document Frequency) multiplied by 2. Elements at even indices contain document indices pointing to documents, elements at odd indices contain the frequency of the word in the document. For example, an array with 4 values:[2, 15, 138, 7]
means that the word appeared 15 times in document at index 2 and 7 times in document at index 138.This array is produced by
CaseNormalizer
. The order of documents in this array is not defined.
-
stemIndex
public int[] stemIndex
A pointer to thePreprocessingContext.AllStems
arrays for this word.This array is produced by
LanguageModelStemmer
.
-
fieldIndices
public byte[] fieldIndices
A bit-packed indices of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to thePreprocessingContext.AllFields
arrays. Fast conversion between the bit-packed representation andbyte[]
with index values is done byPreprocessingContext.toFieldIndexes(byte)
This array is produced by
CaseNormalizer
.
-
-