Package org.carrot2.language
Interface Tokenizer
-
- All Known Implementing Classes:
ExtendedWhitespaceTokenizer
public interface Tokenizer
Splits input characters into tokens representing e.g. words, digits, acronyms, punctuation. For each token, the following information is available:- token type
- Types of tokens: numbers, URIs, punctuation, acronyms and others. See all constants in this
class declared with
TT_
prefix, e.g.TT_TERM
. - token flags
- Additional token flags such as an indication whether a punctuation token is a sentence
delimiter (
TF_SEPARATOR_SENTENCE
).
- See Also:
TokenTypeUtils
-
-
Field Summary
Fields Modifier and Type Field Description static short
TF_COMMON_WORD
The current token is a common word.static short
TF_QUERY_WORD
The current token is part of the query.static short
TF_SEPARATOR_DOCUMENT
Current token is a document separator (never returned from parsing).static short
TF_SEPARATOR_FIELD
Current token separates document's logical fields.static short
TF_SEPARATOR_SENTENCE
Current token is a sentence separator.static short
TF_TERMINATOR
Current token terminates the input (never returned from parsing).static int
TT_ACRONYM
static int
TT_BARE_URL
static int
TT_EMAIL
static int
TT_EOF
Indicates the end of the token stream.static int
TT_FILE
static int
TT_FULL_URL
static int
TT_HYPHTERM
static int
TT_NUMERIC
static int
TT_PUNCTUATION
static int
TT_TERM
static int
TYPE_MASK
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description short
nextToken()
Returns the next token from the input stream.void
reset(Reader reader)
Resets the tokenizer to process new datavoid
setTermBuffer(MutableCharArray array)
Sets the current token image to the provided buffer.
-
-
-
Field Detail
-
TYPE_MASK
static final int TYPE_MASK
- See Also:
- Constant Field Values
-
TT_TERM
static final int TT_TERM
- See Also:
- Constant Field Values
-
TT_NUMERIC
static final int TT_NUMERIC
- See Also:
- Constant Field Values
-
TT_PUNCTUATION
static final int TT_PUNCTUATION
- See Also:
- Constant Field Values
-
TT_EMAIL
static final int TT_EMAIL
- See Also:
- Constant Field Values
-
TT_ACRONYM
static final int TT_ACRONYM
- See Also:
- Constant Field Values
-
TT_FULL_URL
static final int TT_FULL_URL
- See Also:
- Constant Field Values
-
TT_BARE_URL
static final int TT_BARE_URL
- See Also:
- Constant Field Values
-
TT_FILE
static final int TT_FILE
- See Also:
- Constant Field Values
-
TT_HYPHTERM
static final int TT_HYPHTERM
- See Also:
- Constant Field Values
-
TT_EOF
static final int TT_EOF
Indicates the end of the token stream.- See Also:
- Constant Field Values
-
TF_SEPARATOR_SENTENCE
static final short TF_SEPARATOR_SENTENCE
Current token is a sentence separator.- See Also:
- Constant Field Values
-
TF_SEPARATOR_DOCUMENT
static final short TF_SEPARATOR_DOCUMENT
Current token is a document separator (never returned from parsing).- See Also:
- Constant Field Values
-
TF_SEPARATOR_FIELD
static final short TF_SEPARATOR_FIELD
Current token separates document's logical fields.- See Also:
- Constant Field Values
-
TF_TERMINATOR
static final short TF_TERMINATOR
Current token terminates the input (never returned from parsing).- See Also:
- Constant Field Values
-
TF_COMMON_WORD
static final short TF_COMMON_WORD
The current token is a common word. This flag is not directly available from the tokenizer.
-
TF_QUERY_WORD
static final short TF_QUERY_WORD
The current token is part of the query. This flag is not directly available from the tokenizer.
-
-
Method Detail
-
reset
void reset(Reader reader) throws IOException
Resets the tokenizer to process new data- Parameters:
reader
- the input to tokenize. The reader will not be closed by the tokenizer when the end of stream is reached.- Throws:
IOException
-
nextToken
short nextToken() throws IOException
Returns the next token from the input stream.- Returns:
- the type of the token as defined by the
TT_TERM
and other constants orTT_EOF
when the end of the data stream has been reached. - Throws:
IOException
- See Also:
TokenTypeUtils
-
setTermBuffer
void setTermBuffer(MutableCharArray array)
Sets the current token image to the provided buffer.- Parameters:
array
- buffer in which the current token's image should be stored
-
-