Package org.carrot2.language
Class ExtendedWhitespaceTokenizer
java.lang.Object
org.carrot2.language.ExtendedWhitespaceTokenizer
- All Implemented Interfaces:
- Tokenizer
public final class ExtendedWhitespaceTokenizer extends Object implements Tokenizer
A tokenizer separating input characters on whitespace, but capable of extracting more complex
 tokens, such as URLs, e-mail addresses and sentence delimiters.
- 
Field SummaryFields inherited from interface org.carrot2.language.TokenizerTF_COMMON_WORD, TF_QUERY_WORD, TF_SEPARATOR_DOCUMENT, TF_SEPARATOR_FIELD, TF_SEPARATOR_SENTENCE, TF_TERMINATOR, TT_ACRONYM, TT_BARE_URL, TT_EMAIL, TT_EOF, TT_FILE, TT_FULL_URL, TT_HYPHTERM, TT_NUMERIC, TT_PUNCTUATION, TT_TERM, TYPE_MASK
- 
Constructor SummaryConstructors Constructor Description ExtendedWhitespaceTokenizer()
- 
Method SummaryModifier and Type Method Description shortnextToken()Returns the next token from the input stream.voidreset(Reader input)Reset this tokenizer to start parsing another stream.voidsetTermBuffer(MutableCharArray array)Sets the current token image to the provided buffer.
- 
Constructor Details- 
ExtendedWhitespaceTokenizerpublic ExtendedWhitespaceTokenizer()
 
- 
- 
Method Details- 
resetReset this tokenizer to start parsing another stream.
- 
nextTokenDescription copied from interface:TokenizerReturns the next token from the input stream.- Specified by:
- nextTokenin interface- Tokenizer
- Returns:
- the type of the token as defined by the Tokenizer.TT_TERMand other constants orTokenizer.TT_EOFwhen the end of the data stream has been reached.
- Throws:
- IOException
- See Also:
- TokenTypeUtils
 
- 
setTermBufferDescription copied from interface:TokenizerSets the current token image to the provided buffer.- Specified by:
- setTermBufferin interface- Tokenizer
- Parameters:
- array- buffer in which the current token's image should be stored
 
 
-