Interface Tokenizer

  • All Known Implementing Classes:
    ExtendedWhitespaceTokenizer

    public interface Tokenizer
    Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For each token, the following information is available:
    token type
    Types of tokens: numbers, URIs, punctuation, acronyms and others. See all constants in this class declared with TT_ prefix, e.g. TT_TERM.
    token flags
    Additional token flags such as an indication whether a punctuation token is a sentence delimiter (TF_SEPARATOR_SENTENCE).
    See Also:
    TokenTypeUtils