Dictionaries
Dictionaries help Carrot2 to improve the quality of clustering for a specific language. This section discusses the available dictionaries and ways of tuning them to match the content you cluster.
A dictionary defines a set of words or phrases that should be ignored when clustering content in a specific language. Carrot2 comes with basic dictionary files for all the supported languages. Additionally, you can provide per-request dictionaries in the REST API and configure ephemeral dictionaries using Java API.
Types of dictionaries
Carrot2 supports two types of dictionaries:
- Word filters
-
The set of words, commonly known as stop words, that should be ignored during clustering. Typically, word filters would contain function words, such as of or between.
If a word is filtered out, it will not have influence on the clusters the algorithm creates, but may still appear in cluster labels, such as University of Washington, to aid readability.
Word filters are stored in the JSON syntax for each language separately in a file named lang.word-filters.json, where lang is the name of the language, for example english.word-filters.json. Location of the dictionary files is application-specific.
- Label filters
-
The set of words or phrases that must not be used as cluster labels. You can use label filters to suppress offensive or non-informative labels, such as Official website.
Label filters are stored in the JSON syntax for each language separately in a file named lang.label-filters.json, where lang is the name of the language, for example english.label-filters.json. Location of the dictionary files is application-specific.
Location of dictionary files
Carrot2 Document Clustering Server (REST API) loads dictionaries from an explicit non-configurable location provided below. The Java API does not dictate the location of these resources, but comes with reasonable defaults and a way to load them from a custom location.
Document Clustering Server
The Document Clustering Server reads dictionary files from the
web/service/resources
directory in the server's distribution
directory. For changes applied to dictionary files to take effect, you need to
restart the Document Clustering Server.
Java API
By default, Carrot2 Java API will try to read dictionaries
from the JAR file from which the supplier of its corresponding
LanguageComponents
implementation comes from. The following
table presents source JARs and their resource package
for the languages of the standard Carrot2 distribution.
Language(s) | JAR path (package) | JAR |
---|---|---|
Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish | org/carrot2/language |
carrot2-core-4.3.0.jar |
Arabic, Armenian, Brazilian, Bulgarian, Croatian, Czech, Estonian, Galician, Greek, Hindi, Indonesian, Irish, Latvian, Lithuanian, Thai | org/carrot2/language/extras |
carrot2-lang-lucene-4.3.0.jar |
Chinese-Simplified, Chinese-Traditional | org/carrot2/language/chinese |
carrot2-lang-lucene-chinese-4.3.0.jar |
Japanese | org/carrot2/language/japanese |
carrot2-lang-lucene-japanese-4.3.0.jar |
Korean | org/carrot2/language/korean |
carrot2-lang-lucene-korean-4.3.0.jar |
Polish | org/carrot2/language/polish |
carrot2-lang-polish-4.3.0.jar |
Location of dictionary resources for a corresponding languages in the standard Carrot2 distribution.
Note that you should not modify the default dictionaries in-place. Instead, copy the relevant dictionaries to your application-specific location and load them from that location.
Syntax
Both the word- and label filter dictionaries use a common syntax based on matchers. An example word or label filter dictionary is shown below.
A matcher entry is a description of how to determine whether a word or label should
be filtered out. The exact
, glob
and
regexp
properties are optional and can contain an array
of string entries for the specific matcher types, described in the following
sections. If a word or a label matches any matcher of any type, it will
be filtered out.
When performing word filtering, Carrot2 will check the lower-cased version of each word against the dictionary. This is to make it easy to paste a lower-cased list of stop words into the dictionary. This also means that mixed-case entries will never match any input word.
On the other hand, when performing label filtering,
Carrot2 will use the case sensitive version
of the label, as produced by the clustering algorithm. This means that
your label filtering entries must take character case into account
or use a matcher type which is oblivious of character case, such as the
glob
matcher.
Exact matcher
Exact matchers require exact, case-sensitive equality between the word or phrase and the dictionary entry. Exact matcher entries are fast to parse and very fast to apply during clustering.
The above label dictionary definitions will match labels DevOps and Windows 2000, but will not match Devops or Windows 2000 machine.
For case-insensitive matching, use glob matchers (preferably) or case-insensitive regular expression matchers.
Glob matcher
Glob matcher allows simple word-based wildcard matching. The primary use case of the glob matcher is case-insensitive matching of literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob matcher entries are fast to parse and very fast to apply.
Matching rules
-
Each entry must consist of one or more space-separated tokens.
-
A token is a sequence of arbitrary characters, such as words, numbers, identifiers.
-
Matching is case-insensitive by default. Letter case normalization is performed based on the
ROOT
Java locale, which performs language-neutral case conflation according to Unicode rules. -
A token put in single or double quotes, for example
"Rating***"
is taken literally: matching is case-sensitive,*
character inside quoted tokens is allowed and compared literally. -
To include quote characters in the token, escape them with the
\
character, for example:\"information\"
. -
The following wildcard-matching tokens are recognized:
-
The
?
token matches exactly one (any) word (possessive operator). -
The
*
token matches zero or more words (possessive operator). -
The
*?
token matches zero or more words (reluctant operator). -
The
+
token matches one or more words (possessive operator). This operator is functionally equivalent to:? *
. -
The
+?
token matches one or more words (reluctant operator). This operator is functionally equivalent to:? *?
.
A few restrictions apply to wildcard operators.
-
Wildcard characters (
*
,+
) cannot be used to express prefixes or suffixes. For example,programm*
, is not supported. -
*?
and+?
wildcards are reluctant matchers in regular expression matching sense: that is, they match the minimal sequence of tokens until the next token in the pattern. -
*
and+
wildcards are possessive matchers in regular expression matching sense: that is, they match the maximum sequence of tokens until the next token in the pattern. -
Greedy operators are not supported.
-
Example entries
The following table shows a number of example glob entries. The "Non-matching strings" column also has an explanation why there is no match.
Entry | Matching strings | Non-matching strings |
---|---|---|
more information |
|
|
more information * |
|
|
* information * |
|
|
+ information |
|
|
"Information" * |
|
|
data ? |
|
|
"Programm*" |
|
|
\"information\" |
|
|
programm* |
Illegal pattern, combinations of the * wildcard and other characters
are not supported.
|
|
"information |
Illegal pattern, unbalanced double quotes. | |
* |
Illegal pattern, there must be at least one non-wildcard token. |
Regular expression matcher
The regular expression matcher checks words or labels against a list of regular expressions you provide.
The regular expressions must follow Java syntax. If any fragment of a word or label matches any regular expression provided in the dictionary, the word or label will be filtered out.
Per-request dictionaries
In some contexts, it may be desirable to apply additional label or word exclusions for a specific clustering request. One example is allowing the users to interactively tune the exclusions directory while observing the results in real time.
Providing per-request dictionaries is possible both in the Java API, where the ephemeral dictionaries is a better term, and in the REST API. In both cases, the available set of filters and matcher syntax is the same as described in this section. The per-request dictionaries are added on top of the existing dictionary files.
Please note that using per-request dictionaries will incur an extra performance cost associated with parsing the provided dictionary entries. In case of interactive label exclusion tuning, you may want to transfer the newly created dictionary entries to the main dictionary file, so that the exclusions apply for all further clustering requests.