Package org.carrot2.text.preprocessing
Class PreprocessingContext.AllStems
- java.lang.Object
-
- org.carrot2.text.preprocessing.PreprocessingContext.AllStems
-
- Enclosing class:
- PreprocessingContext
public class PreprocessingContext.AllStems extends Object
Information about all unique stems found in the input documents. Each entry in each array corresponds to one base form different words can be transformed to by theStemmer
used while processing. E.g. the English mining and mine will be aggregated to one entry in the arrays, while they will have separate entries inPreprocessingContext.AllWords
.All arrays in this class have the same length and values across different arrays correspond to each other for the same index.
-
-
Field Summary
Fields Modifier and Type Field Description byte[]
fieldIndices
A bit-packed indices of all fields in which this word appears at least once.char[][]
image
Stem image as produced by theStemmer
, may not correspond to any correct word.int[]
mostFrequentOriginalWordIndex
Pointer to thePreprocessingContext.AllWords
arrays, to the most frequent original form of the stem.int[]
tf
Term frequency of the stem, i.e.int[][]
tfByDocument
Term frequency of the stem for each document.
-
Constructor Summary
Constructors Constructor Description AllStems()
-
-
-
Field Detail
-
image
public char[][] image
Stem image as produced by theStemmer
, may not correspond to any correct word.This array is produced by
LanguageModelStemmer
.
-
mostFrequentOriginalWordIndex
public int[] mostFrequentOriginalWordIndex
Pointer to thePreprocessingContext.AllWords
arrays, to the most frequent original form of the stem. Pointers to the less frequent variants are not available.This array is produced by
LanguageModelStemmer
.
-
tf
public int[] tf
Term frequency of the stem, i.e. the sum of allPreprocessingContext.AllWords.tf
values for which thePreprocessingContext.AllWords.stemIndex
points to this stem.This array is produced by
LanguageModelStemmer
.
-
tfByDocument
public int[][] tfByDocument
Term frequency of the stem for each document. For the encoding of this array, seePreprocessingContext.AllWords.tfByDocument
.This array is produced by
LanguageModelStemmer
. The order of documents in this array is not defined.
-
fieldIndices
public byte[] fieldIndices
A bit-packed indices of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to thePreprocessingContext.AllFields
arrays. Fast conversion between the bit-packed representation andbyte[]
with index values is done byPreprocessingContext.toFieldIndexes(byte)
This array is produced by
LanguageModelStemmer
-
-