Dictionaries

Dictionaries help Carrot2 to improve the quality of clustering for a specific language. This section discusses the available dictionaries and ways of tuning them to match the content you cluster.

A dictionary defines a set of words or phrases that should be ignored when clustering content in a specific language. Carrot2 comes with basic dictionary files for all the supported languages. Additionally, you can provide per-request dictionaries in the REST API and configure ephemeral dictionaries using Java API.

Types of dictionaries

Carrot2 supports two types of dictionaries:

Word filters

The set of words, commonly known as stop words, that should be ignored during clustering. Typically, word filters would contain function words, such as of or between.

If a word is filtered out, it will not have influence on the clusters the algorithm creates, but may still appear in cluster labels, such as University of Washington, to aid readability.

Word filters are stored in the JSON syntax for each language separately in a file named lang.word-filters.json, where lang is the name of the language, for example english.word-filters.json. Location of the dictionary files is application-specific.

Label filters

The set of words or phrases that must not be used as cluster labels. You can use label filters to suppress offensive or non-informative labels, such as Official website.

Label filters are stored in the JSON syntax for each language separately in a file named lang.label-filters.json, where lang is the name of the language, for example english.label-filters.json. Location of the dictionary files is application-specific.

Location of dictionary files

Carrot2 Document Clustering Server (REST API) loads dictionaries from an explicit non-configurable location provided below. The Java API does not dictate the location of these resources, but comes with reasonable defaults and a way to load them from a custom location.

Document Clustering Server

The Document Clustering Server reads dictionary files from the web/service/resources directory in the server's distribution directory. For changes applied to dictionary files to take effect, you need to restart the Document Clustering Server.

Java API

By default, Carrot2 Java API will try to read dictionaries from the JAR file from which the supplier of its corresponding LanguageComponents implementation comes from. The following table presents source JARs and their resource package for the languages of the standard Carrot2 distribution.

Language(s) JAR path (package) JAR
Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish org/carrot2/language carrot2-core-4.5.1.jar
Arabic, Armenian, Brazilian, Bulgarian, Croatian, Czech, Estonian, Galician, Greek, Hindi, Indonesian, Irish, Latvian, Lithuanian, Thai org/carrot2/language/extras carrot2-lang-lucene-4.5.1.jar
Chinese-Simplified, Chinese-Traditional org/carrot2/language/chinese carrot2-lang-lucene-chinese-4.5.1.jar
Japanese org/carrot2/language/japanese carrot2-lang-lucene-japanese-4.5.1.jar
Korean org/carrot2/language/korean carrot2-lang-lucene-korean-4.5.1.jar
Polish org/carrot2/language/polish carrot2-lang-polish-4.5.1.jar

Location of dictionary resources for a corresponding languages in the standard Carrot2 distribution.

Note that you should not modify the default dictionaries in-place. Instead, copy the relevant dictionaries to your application-specific location and load them from that location.

Syntax

Both the word- and label filter dictionaries use a common syntax based on matchers. An example word or label filter dictionary is shown below.

{
  "exact": [
    "Literal case-sensitive match"
  ],
  "glob": [
    "starts with phrase *",
    "* contains words *"
  ],
  "regexp": [
    "(?).+BadLabel.+",
    "(?)^[0-9]\\s*.*"
  ]
}

An example word or label filter dictionary with entries of three different matcher types.

A matcher entry is a description of how to determine whether a word or label should be filtered out. The exact, glob and regexp properties are optional and can contain an array of string entries for the specific matcher types, described in the following sections. If a word or a label matches any matcher of any type, it will be filtered out.

Heads up, word and label character case!

When performing word filtering, Carrot2 will check the lower-cased version of each word against the dictionary. This is to make it easy to paste a lower-cased list of stop words into the dictionary. This also means that mixed-case entries will never match any input word.

On the other hand, when performing label filtering, Carrot2 will use the case sensitive version of the label, as produced by the clustering algorithm. This means that your label filtering entries must take character case into account or use a matcher type which is oblivious of character case, such as the glob matcher.

Exact matcher

Exact matchers require exact, case-sensitive equality between the word or phrase and the dictionary entry. Exact matcher entries are fast to parse and very fast to apply during clustering.

{
  "exact": [
    "DevOps",
    "Windows 2000"
  ]
}

An example label filter dictionary with two exact matcher entries.

The above label dictionary definitions will match labels DevOps and Windows 2000, but will not match Devops or Windows 2000 machine.

For case-insensitive matching, use glob matchers (preferably) or case-insensitive regular expression matchers.

Glob matcher

Glob matcher allows simple word-based wildcard matching. The primary use case of the glob matcher is case-insensitive matching of literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob matcher entries are fast to parse and very fast to apply.

{
  "glob": [
    "more information",
    "more information *",
    "* about *",
    "big ?",
    "+ apple"
  ]
}

An example label filter dictionary with glob matcher entries.

Matching rules

  • Each entry must consist of one or more space-separated tokens.

  • A token is a sequence of arbitrary characters, such as words, numbers, identifiers.

  • Matching is case-insensitive by default. Letter case normalization is performed based on the ROOT Java locale, which performs language-neutral case conflation according to Unicode rules.

  • A token put in single or double quotes, for example "Rating***" is taken literally: matching is case-sensitive, * character inside quoted tokens is allowed and compared literally.

  • To include quote characters in the token, escape them with the \ character, for example: \"information\".

  • The following wildcard-matching tokens are recognized:

    • The ? token matches exactly one (any) word (possessive operator).

    • The * token matches zero or more words (possessive operator).

    • The *? token matches zero or more words (reluctant operator).

    • The + token matches one or more words (possessive operator). This operator is functionally equivalent to: ? *.

    • The +? token matches one or more words (reluctant operator). This operator is functionally equivalent to: ? *?.

    A few restrictions apply to wildcard operators.

    • Wildcard characters (*, +) cannot be used to express prefixes or suffixes. For example, programm*, is not supported.

    • *? and +? wildcards are reluctant matchers in regular expression matching sense: that is, they match the minimal sequence of tokens until the next token in the pattern.

    • * and + wildcards are possessive matchers in regular expression matching sense: that is, they match the maximum sequence of tokens until the next token in the pattern.

    • Greedy operators are not supported.

Example entries

The following table shows a number of example glob entries. The "Non-matching strings" column also has an explanation why there is no match.

Entry Matching strings Non-matching strings
more information

More information

MORE INFORMATION

more informations'informations' does not match pattern token 'information'.

more information aboutPattern does not contain wildards, only 2-word strings can match.

some more informationPattern does not contain wildards, only 2-word strings can match.

more information *

more information

More information about

More information about a

more informations'informations' does not match pattern token 'information'.

more informations about'informations' does not match pattern token 'information'.

some more informationPattern does not have wildcards at the beginning, matching strings must start with 'more information'.

* information *

information

more information

information about

a lot more information on

informations'informations' does not match pattern token 'information'.

more informations about'informations' does not match pattern token 'information'.

some more informations'informations' does not match pattern token 'information'.

+ information

too much information

more information

information+ wildcard requires at least one word before 'information'.

more information about'about' is an extra word not covered by the pattern.

"Information" *

Information

Information about

Information ABOUT

information"Information" token is case-sensitive, it does not match 'information'.

information about"Information" token is case-sensitive, it does not match 'information'.

Informations about'Informations' does not match pattern token "Information".

data ?

data mining

data? operator requires a word after "data".

data mining research"research" token does not match the pattern.

"Programm*"

Programm*

Programmer"Programm*" token is taken literally, it matches only 'Programm*'.

Programming"Programm*" token is taken literally, it matches only 'Programm*'.

\"information\"

"information"

"INFOrmation"Escaped quotes are taken literally, so match is case-insensitive

informationEscaped quotes not found in the string being matched.

"informationEscaped quotes not found in the string being matched.

programm* Illegal pattern, combinations of the * wildcard and other characters are not supported.
"information Illegal pattern, unbalanced double quotes.
* Illegal pattern, there must be at least one non-wildcard token.

Regular expression matcher

The regular expression matcher checks words or labels against a list of regular expressions you provide.

{
  "regexp": [
    "Windows 9[58]",
    "(?)^[0-9]\\s*.*"
  ]
}

An example label filter dictionary with two regexp matcher entries.

The regular expressions must follow Java syntax. If any fragment of a word or label matches any regular expression provided in the dictionary, the word or label will be filtered out.

Heads up, performance impact!

Regular expression-based matching is a powerful mechanism, but it can also result in a dramatic decrease of clustering performance. Therefore, it should be used only when a similar effect cannot be achieved by reasonable number of exact and glob matching entries.

Per-request dictionaries

In some contexts, it may be desirable to apply additional label or word exclusions for a specific clustering request. One example is allowing the users to interactively tune the exclusions directory while observing the results in real time.

Providing per-request dictionaries is possible both in the Java API, where the ephemeral dictionaries is a better term, and in the REST API. In both cases, the available set of filters and matcher syntax is the same as described in this section. The per-request dictionaries are added on top of the existing dictionary files.

Please note that using per-request dictionaries will incur an extra performance cost associated with parsing the provided dictionary entries. In case of interactive label exclusion tuning, you may want to transfer the newly created dictionary entries to the main dictionary file, so that the exclusions apply for all further clustering requests.