org.carrot2.language.ExtendedWhitespaceTokenizer

All Implemented Interfaces:: Tokenizer

public final class ExtendedWhitespaceTokenizer
extends Object
implements Tokenizer

A tokenizer separating input characters on whitespace, but capable of extracting more complex tokens, such as URLs, e-mail addresses and sentence delimiters.

Field Summary

Fields inherited from interface org.carrot2.language.Tokenizer
TF_COMMON_WORD, TF_QUERY_WORD, TF_SEPARATOR_DOCUMENT, TF_SEPARATOR_FIELD, TF_SEPARATOR_SENTENCE, TF_TERMINATOR, TT_ACRONYM, TT_BARE_URL, TT_EMAIL, TT_EOF, TT_FILE, TT_FULL_URL, TT_HYPHTERM, TT_NUMERIC, TT_PUNCTUATION, TT_TERM, TYPE_MASK
Constructor Summary

Constructors

Constructor Description

ExtendedWhitespaceTokenizer()

Method Summary

Modifier and Type	Method	Description
`short`	`nextToken()`	Returns the next token from the input stream.
`void`	`reset(Reader input)`	Reset this tokenizer to start parsing another stream.
`void`	`setTermBuffer(MutableCharArray array)`	Sets the current token image to the provided buffer.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ExtendedWhitespaceTokenizer
  
  public ExtendedWhitespaceTokenizer()
Method Details
- reset
  
  public void reset(Reader input)
  
  Reset this tokenizer to start parsing another stream.
  
  Specified by:
  
  reset in interface Tokenizer
  
  Parameters:
  
  input - the input to tokenize. The reader will not be closed by the tokenizer when the end of stream is reached.
- nextToken
  
  public short nextToken() throws IOException
  
  Description copied from interface: Tokenizer
  
  Returns the next token from the input stream.
  
  Specified by:
  
  nextToken in interface Tokenizer
  
  Returns:
  
  the type of the token as defined by the Tokenizer.TT_TERM and other constants or Tokenizer.TT_EOF when the end of the data stream has been reached.
  
  Throws:
  
  IOException
  
  See Also:
  
  TokenTypeUtils
- setTermBuffer
  
  public void setTermBuffer(MutableCharArray array)
  
  Description copied from interface: Tokenizer
  
  Sets the current token image to the provided buffer.
  
  Specified by:
  
  setTermBuffer in interface Tokenizer
  
  Parameters:
  
  array - buffer in which the current token's image should be stored

Class ExtendedWhitespaceTokenizer

Field Summary

Fields inherited from interface org.carrot2.language.Tokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ExtendedWhitespaceTokenizer

Method Details

reset

nextToken

setTermBuffer