Language Components

In this section we describe the role and potential customisation options for language-specific heavyweight components.

LanguageComponent instances provide assistance and hints to improve the quality of clustering for a specific language. The resources associated with this task typically require costly processing to load and parse, so LanguageComponent instances should be created early and reused for all subsequent clustering calls.

A single LanguageComponent instance can provide a number of resources to aid the algorithm in improving clustering quality:

  • lemmatisation (stemming) routines,
  • tokenisation (word decomposition) and decompounding routines,
  • part of speech dictionaries,
  • ignore lists (stop word lists).

The above list is not restricted or complete: each algorithm may require a different set of components to function properly and each language will have a specific set of components that makes sense for that particular language. All available languages and their provided components (interfaces) can be enumerated via service point extension, as shown below.

ServiceLoader<LanguageComponentsProvider> providers =
    ServiceLoader.load(LanguageComponentsProvider.class);
for (LanguageComponentsProvider prov : providers) {
  System.out.println("Provider class: " + prov.name());

  for (String language : prov.languages()) {
    System.out.println("  > " + language);
    for (Class<?> componentClass : prov.load(language).keySet()) {
      System.out.println("    Component: " + componentClass.getName());
    }
  }
}

The code above prints the following when executed, revealing interfaces of components provided by the default provider for English (other languages omitted):

Provider class: Carrot2 (English, Danish, Dutch, [...]
  > English
    Component: org.carrot2.language.Stemmer
    Component: org.carrot2.language.Tokenizer
    Component: org.carrot2.language.LexicalData
    Component: org.carrot2.text.preprocessing.LabelFormatter
  [...]

Each of the interfaces of a language component can be overridden. However, for the vast majority of applications only minor tweaks to the defaults are really needed. For example, adding ignorable terms and phrases specific to a given domain is a very helpful step to increase clustering quality (patent documents will have a different lexicon compared to an unrestricted e-mail corpus, for example). Let's take a look at how this is done first.

Some algorithms may not support all languages

Clustering algorithms may support a subset of languages and language component instances assembled manually. Each clustering algorithm has a method called supports(LanguageComponents lc) which can be used to check whether the algorithm works with a given instance of language components.

Tweaking lexical data resources

The LexicalData component is responsible for telling the algorithm whether a term or a candidate cluster label should be ignored. Such ignorable elements typically include function words like prepositions, conjunctions or domain-specific boilerplate language that is not a good candidate for a cluster.

public interface LexicalData {
  /** @return Return true if the provided term should be ignored in processing. */
  boolean ignoreWord(CharSequence word);

  /** @return Return true if the label candidate should be ignored in processing. */
  boolean ignoreLabel(CharSequence labelCandidate);
}

The most common quality tuning step is modifying the resources provided for the LexicalData component's default implementation (LexicalDataImpl). This class attempts to load two files (where the lang prefix is replaced with the name of the language the component is instantiated for):

lang.stopwords.utf8
A plain-text list of words to ignore. Each line contains exactly one full word (token). Something like the following:
a
about
above
across
...
lang.stoplabels.utf8
A plain-text list of Java regular expression patterns to ignore. Each line contains exactly one expression that, if it matches the input label, causes the label to be discarded and ignored during clustering. An example stop labels file could look like this:
(?i)information (about|on).*
(?i)(index|list) of.*
...

The LanguageComponent class has a variant of the load() method that takes a custom ResourceLookup object for locating and loading external resources. We can load our custom, modified resources using this facility without touching any of the default components. It's a good idea to start tweaking custom resources by first copying the defaults from Carrot2 JARs and then modifying them in the new location.

ResourceLookup resLookup = new ClassRelativeResourceLookup(E03_CustomLanguageComponents.class);
LanguageComponents custom = LanguageComponents.load("English", resLookup);

Once loaded, the LanguageComponents object is ready to be used for subsequent clustering calls.

Tweaking components directly

It may be sometimes desirable to provide an entirely custom implementation of a given preprocessing component. An instance of LanguageComponents class consists of a set of components (interfaces) and suppliers providing implementations of these interfaces. We can override any of these suppliers, creating a custom language preprocessing pipeline.

As an example, let's modify the English pipeline to use a custom word stemming component. The supplier for this component always returns the same closure, converting the input to lower case:

Supplier<Stemmer> stemmerSupplier;
stemmerSupplier = () -> (word) -> word.toString().toLowerCase(Locale.ROOT);

We will also replace the default LexicalData implementation with simple custom filtering based on hardcoded rules:

final Set<String> ignored = new HashSet<>(Arrays.asList("from", "what"));
Supplier<LexicalData> lexicalDataSupplier =
    () ->
        new LexicalData() {
          @Override
          public boolean ignoreLabel(CharSequence candidate) {
            // Ignore any label that has a substring 'data' in it.
            return candidate.toString().toLowerCase(Locale.ROOT).contains("data");
          }

          @Override
          public boolean ignoreWord(CharSequence word) {
            return word.length() < 4 || ignored.contains(word.toString());
          }
        };

Note how the initialization of the ignored set is pulled out of the supplier. The cost of setting up a heavy component (the supplier itself here) can be high but once set up, suppliers should return instances of a given component quickly.

Once we have custom suppliers, overriding their default implementations for English is trivial:

LanguageComponents customized =
    LanguageComponents.load("English")
        .override(Stemmer.class, stemmerSupplier)
        .override(LexicalData.class, lexicalDataSupplier);

Support for other languages

The core Carrot2 JAR supports only a limited set of popular languages. Additional modules add support for other languages and bring in extra resources required for these languages to work properly.

All language extensions live under the org.carrot2.lang artifact group namespace. Note that many of them come with sizeable own dependencies like Apache Lucene analyzers or dictionaries.

  • carrot2-lang-lucene-chinese: Chinese (traditional and simplified).
  • carrot2-lang-lucene-japanese: Japanese.
  • carrot2-lang-lucene-korean: Korean.
  • carrot2-lang-lucene: adds support for many other languages supported by the Apache Lucene project.
  • carrot2-lang-polish: Polish.

A language extension is automatically added to the list of supported language components if it can be located and loaded by Java's standard ServiceLoader.