Package org.carrot2.language
Class ExtendedWhitespaceTokenizer
- java.lang.Object
-
- org.carrot2.language.ExtendedWhitespaceTokenizer
-
-
Field Summary
-
Fields inherited from interface org.carrot2.language.Tokenizer
TF_COMMON_WORD, TF_QUERY_WORD, TF_SEPARATOR_DOCUMENT, TF_SEPARATOR_FIELD, TF_SEPARATOR_SENTENCE, TF_TERMINATOR, TT_ACRONYM, TT_BARE_URL, TT_EMAIL, TT_EOF, TT_FILE, TT_FULL_URL, TT_HYPHTERM, TT_NUMERIC, TT_PUNCTUATION, TT_TERM, TYPE_MASK
-
-
Constructor Summary
Constructors Constructor Description ExtendedWhitespaceTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description short
nextToken()
Returns the next token from the input stream.void
reset(Reader input)
Reset this tokenizer to start parsing another stream.void
setTermBuffer(MutableCharArray array)
Sets the current token image to the provided buffer.
-
-
-
Method Detail
-
reset
public void reset(Reader input)
Reset this tokenizer to start parsing another stream.
-
nextToken
public short nextToken() throws IOException
Description copied from interface:Tokenizer
Returns the next token from the input stream.- Specified by:
nextToken
in interfaceTokenizer
- Returns:
- the type of the token as defined by the
Tokenizer.TT_TERM
and other constants orTokenizer.TT_EOF
when the end of the data stream has been reached. - Throws:
IOException
- See Also:
TokenTypeUtils
-
setTermBuffer
public void setTermBuffer(MutableCharArray array)
Description copied from interface:Tokenizer
Sets the current token image to the provided buffer.- Specified by:
setTermBuffer
in interfaceTokenizer
- Parameters:
array
- buffer in which the current token's image should be stored
-
-