Package org.carrot2.language
Lexical component interfaces and implementations.
-
Interface Summary Interface Description ComponentLoader LanguageComponentsProvider LexicalData Provides additional word and label filtering information for a given language.Stemmer Simple lemmatization engine transforming an inflected form of a word to its base form or some other unique token.Tokenizer Splits input characters into tokens representing e.g. -
Class Summary Class Description DefaultLanguageComponentsProvider ExtendedWhitespaceTokenizer A tokenizer separating input characters on whitespace, but capable of extracting more complex tokens, such as URLs, e-mail addresses and sentence delimiters.LanguageComponents A set of language-specific components.LanguageComponentsProviderImpl LexicalDataImpl LexicalData
implemented on top of a hash set (stopwords) and a regular expression pattern (stoplabels).SnowballStemmerAdapter An adapter converting Snowball programs intoStemmer
interface.TokenTypeUtils Utility methods for working withTokenizer
attributes.