Class PreprocessingContext.AllStems

  • Enclosing class:
    PreprocessingContext

    public class PreprocessingContext.AllStems
    extends Object
    Information about all unique stems found in the input documents. Each entry in each array corresponds to one base form different words can be transformed to by the Stemmer used while processing. E.g. the English mining and mine will be aggregated to one entry in the arrays, while they will have separate entries in PreprocessingContext.AllWords.

    All arrays in this class have the same length and values across different arrays correspond to each other for the same index.

    • Field Detail

      • image

        public char[][] image
        Stem image as produced by the Stemmer, may not correspond to any correct word.

        This array is produced by LanguageModelStemmer.

      • mostFrequentOriginalWordIndex

        public int[] mostFrequentOriginalWordIndex
        Pointer to the PreprocessingContext.AllWords arrays, to the most frequent original form of the stem. Pointers to the less frequent variants are not available.

        This array is produced by LanguageModelStemmer.

      • tfByDocument

        public int[][] tfByDocument
        Term frequency of the stem for each document. For the encoding of this array, see PreprocessingContext.AllWords.tfByDocument.

        This array is produced by LanguageModelStemmer. The order of documents in this array is not defined.

      • fieldIndices

        public byte[] fieldIndices
        A bit-packed index of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to the PreprocessingContext.AllFields arrays. Fast conversion between the bit-packed representation and byte[] with index values is done by PreprocessingContext.toFieldIndexes(byte)

        This array is produced by LanguageModelStemmer

    • Constructor Detail

      • AllStems

        public AllStems()
    • Method Detail