Interface Tokenizer

All Known Implementing Classes:
ExtendedWhitespaceTokenizer

public interface Tokenizer
Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For each token, the following information is available:
token type
Types of tokens: numbers, URIs, punctuation, acronyms and others. See all constants in this class declared with TT_ prefix, e.g. TT_TERM.
token flags
Additional token flags such as an indication whether a punctuation token is a sentence delimiter (TF_SEPARATOR_SENTENCE).
See Also:
TokenTypeUtils