Class PreprocessingContext.AllTokens

  • Enclosing class:
    PreprocessingContext

    public class PreprocessingContext.AllTokens
    extends Object
    Information about all tokens of the input documents. Each element of each of the arrays corresponds to one individual token from the input or a synthetic separator inserted between documents, fields and sentences. Last element of this array is a special terminator entry.

    All arrays in this class have the same length and values across different arrays correspond to each other for the same index.

    • Field Detail

      • type

        public short[] type
        Token's Tokenizer bit flags.

        This array is produced by InputTokenizer.

      • fieldIndex

        public byte[] fieldIndex
        Document field the token came from. The index points to arrays in PreprocessingContext.AllFields, equal to -1 for document and field separators.

        This array is produced by InputTokenizer.

      • documentIndex

        public int[] documentIndex
        Index of the document this token came from, points to elements of documents. Equal to -1 for document separators.

        This array is produced by InputTokenizer.

        This array is accessed in CaseNormalizer and PhraseExtractor to compute by-document statistics, e.g. tf-by document, which are then needed to build a VSM or assign documents to labels. An alternative to this representation would be creating an AllDocuments holder and keep there an array of start token indexes for each document and then refactor the model building code to do a binary search to determine the document index given token index. This is likely to be a significant performance hit because model building code accesses the documentIndex array pretty much randomly (in the suffix order), so we'd be doing twice-the-number-of-tokens binary searches. Unless there's some other data structure that can help us here.

      • suffixOrder

        public int[] suffixOrder
        The suffix order of tokens. Suffixes starting with a separator come at the end of the array.

        This array is produced by PhraseExtractor.

      • lcp

        public int[] lcp
        The Longest Common Prefix for the adjacent suffix-sorted token sequences.

        This array is produced by PhraseExtractor.

    • Constructor Detail

      • AllTokens

        public AllTokens()
    • Method Detail