Package org.carrot2.language
Class ExtendedWhitespaceTokenizer
java.lang.Object
org.carrot2.language.ExtendedWhitespaceTokenizer
- All Implemented Interfaces:
Tokenizer
public final class ExtendedWhitespaceTokenizer extends Object implements Tokenizer
A tokenizer separating input characters on whitespace, but capable of extracting more complex
tokens, such as URLs, e-mail addresses and sentence delimiters.
-
Field Summary
Fields inherited from interface org.carrot2.language.Tokenizer
TF_COMMON_WORD, TF_QUERY_WORD, TF_SEPARATOR_DOCUMENT, TF_SEPARATOR_FIELD, TF_SEPARATOR_SENTENCE, TF_TERMINATOR, TT_ACRONYM, TT_BARE_URL, TT_EMAIL, TT_EOF, TT_FILE, TT_FULL_URL, TT_HYPHTERM, TT_NUMERIC, TT_PUNCTUATION, TT_TERM, TYPE_MASK -
Constructor Summary
Constructors Constructor Description ExtendedWhitespaceTokenizer() -
Method Summary
Modifier and Type Method Description shortnextToken()Returns the next token from the input stream.voidreset(Reader input)Reset this tokenizer to start parsing another stream.voidsetTermBuffer(MutableCharArray array)Sets the current token image to the provided buffer.
-
Constructor Details
-
ExtendedWhitespaceTokenizer
public ExtendedWhitespaceTokenizer()
-
-
Method Details
-
reset
Reset this tokenizer to start parsing another stream. -
nextToken
Description copied from interface:TokenizerReturns the next token from the input stream.- Specified by:
nextTokenin interfaceTokenizer- Returns:
- the type of the token as defined by the
Tokenizer.TT_TERMand other constants orTokenizer.TT_EOFwhen the end of the data stream has been reached. - Throws:
IOException- See Also:
TokenTypeUtils
-
setTermBuffer
Description copied from interface:TokenizerSets the current token image to the provided buffer.- Specified by:
setTermBufferin interfaceTokenizer- Parameters:
array- buffer in which the current token's image should be stored
-