Class PreprocessingContext.AllWords

  • Enclosing class:
    PreprocessingContext

    public class PreprocessingContext.AllWords
    extends Object
    Information about all unique words found in the input documents. An entry in each parallel array corresponds to one conflated form of a word. For example, data and DATA will most likely become a single entry in the words table. However, different grammatical forms of a single lemma (like computer and computers) will have different entries in the words table. See PreprocessingContext.AllStems for inflection-conflated versions.

    All arrays in this class have the same length and values across different arrays correspond to each other for the same index.

    • Field Detail

      • image

        public char[][] image
        The most frequently appearing variant of the word with respect to case. E.g. if a token MacOS appeared 12 times in the input and macos appeared 3 times, the image will be equal to MacOS.

        This array is produced by CaseNormalizer.

      • type

        public short[] type
        Token type of this word copied from PreprocessingContext.AllTokens.type. Additional flags are set for each word by CaseNormalizer and LanguageModelStemmer.

        This array is produced by CaseNormalizer. This array is modified by LanguageModelStemmer.

        See Also:
        Tokenizer
      • tf

        public int[] tf
        Term Frequency of the word, aggregated across all variants with respect to case. Frequencies for each variant separately are not available.

        This array is produced by CaseNormalizer.

      • tfByDocument

        public int[][] tfByDocument
        Term Frequency of the word for each document. The length of this array is equal to the number of documents this word appeared in (Document Frequency) multiplied by 2. Elements at even indices contain document indices pointing to documents, elements at odd indices contain the frequency of the word in the document. For example, an array with 4 values: [2, 15, 138, 7] means that the word appeared 15 times in document at index 2 and 7 times in document at index 138.

        This array is produced by CaseNormalizer. The order of documents in this array is not defined.

      • stemIndex

        public int[] stemIndex
        A pointer to the PreprocessingContext.AllStems arrays for this word.

        This array is produced by LanguageModelStemmer.

      • fieldIndices

        public byte[] fieldIndices
        A bit-packed index of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to the PreprocessingContext.AllFields arrays. Fast conversion between the bit-packed representation and byte[] with index values is done by PreprocessingContext.toFieldIndexes(byte)

        This array is produced by CaseNormalizer.

    • Constructor Detail

      • AllWords

        public AllWords()
    • Method Detail