Language Components

In this section we describe the role and potential customisation options for language-specific heavyweight components.

LanguageComponent instances provide assistance and hints to improve the quality of clustering for a specific language. The resources associated with this task typically require costly processing to load and parse, so LanguageComponent instances should be created early and reused for all subsequent clustering calls.

A single LanguageComponent instance can provide a number of resources to aid the algorithm in improving clustering quality:

  • lemmatisation (stemming) routines,
  • tokenisation (word decomposition) and decompounding routines,
  • part of speech dictionaries,
  • word or label filters (dictionaries).

The above list is not restricted or complete: each algorithm may require a different set of components to function properly and each language will have a specific set of components that makes sense for that particular language. All available languages and their provided components (interfaces) can be enumerated via service point extension, as shown below.

ServiceLoader<LanguageComponentsProvider> providers =
    ServiceLoader.load(LanguageComponentsProvider.class);

Map<String, List<LanguageComponentsProvider>> langToProviders = new TreeMap<>();
for (LanguageComponentsProvider prov : providers) {
  for (String lang : prov.languages()) {
    langToProviders.computeIfAbsent(lang, (k) -> new ArrayList<>()).add(prov);
  }
}

langToProviders.forEach(
    (language, provList) -> {
      System.out.println("  > " + language);
      provList.forEach(
          provider -> {
            System.out.println("    [Provider: " + provider.name() + "]");
            for (Class<?> componentClass : provider.componentTypes()) {
              System.out.println("      Component: " + componentClass.getName());
            }
          });
    });

The code above prints the following when executed, revealing providers and components supplied for the English language (other languages omitted):

> English
  [Provider: Carrot2 Core (Stemmers)]
    Component: org.carrot2.language.Stemmer
  [Provider: Carrot2 Core (Tokenizers)]
    Component: org.carrot2.language.Tokenizer
  [Provider: Carrot2 Core (Lexical Data)]
    Component: org.carrot2.language.LabelFilter
    Component: org.carrot2.language.StopwordFilter
  [Provider: Carrot2 Core (Label Formatter)]
    Component: org.carrot2.text.preprocessing.LabelFormatter
[...]

Each of the interfaces of a language component can be overridden. However, for the vast majority of applications only minor tweaks to the defaults are really needed. For example, adding ignorable terms and phrases specific to a given domain is a very helpful step to increase clustering quality (patent documents will have a different lexicon compared to an unrestricted e-mail corpus, for example). Let's take a look at how this is done first.

Some algorithms may not support all languages

A clustering algorithm in Carrot2 may support only a subset of the languages available from the LanguageComponentsProvider extension point. Each clustering algorithm has a method called supports(LanguageComponents lc) which can be used to test whether the algorithm works with a given language.

Tweaking lexical filters

Lexical filters are responsible for telling the algorithm whether a term or a candidate cluster label should be ignored. Such ignorable elements typically include function words like prepositions, conjunctions or domain-specific boilerplate language that is not a good candidate for a cluster.

The Java API defines two components (interfaces) that are part of each loaded LanguageComponents instance:

  • The StopwordFilter interface controls low-level filtering of individual terms:

    public interface StopwordFilter extends Predicate<CharSequence> {
      /**
       * @param word The word to test. Input words are guaranteed to be in lower case (consistent with
       *     {@link Character#toLowerCase(int)}.
       * @return Return {@code false} if the provided term should be ignored in processing.
       */
      boolean test(CharSequence word);
  • The LabelFilter interface controls low-level filtering of undesired cluster labels.

    public interface LabelFilter extends Predicate<CharSequence> {
      /**
       * @param label The label to test. Input labels may have mixed case, depending on the algorithm
       *     and their surface forms collected from input documents.
       * @return Return {@code false} if the label candidate should be ignored in processing.
       */
      boolean test(CharSequence label);

Both these components can be explicitly provided (or overridden at runtime) in a LanguageComponents instance but implementing a custom filter is often impractical. It is much easier to reuse the default implementation that builds filters using filter dictionaries encoded in JSON.

To instantiate LanguageComponents for a particular language with a custom set of dictionary resources, copy the default resources shipped with the distribution somewhere where your application can load them.

The LanguageComponentsLoader class has a method to provide a custom resource loader (ResourceLookup implementation) responsible for locating any external resources language data may ask for.

To restrict the number of loaded resources to only a subset of all available languages (and/ or algorithms), use the loader's limitToLanguages and limitToAlgorithms methods. Here is a full example that loads English resources from a class-relative classpath location:

LanguageComponents custom =
    LanguageComponents.loader()
        // Note we restrict languages to just English because resources for
        // other languages are missing from the location of resource lookup
        // and would have caused an exception.
        .limitToLanguages("English")
        // and we substitute resource lookup locations with our custom location.
        .withResourceLookup(
            provider -> new ClassRelativeResourceLookup(E03_CustomLanguageComponents.class))
        .load()
        .language("English");

Once loaded, the LanguageComponents object is ready to be used for subsequent clustering calls.

Note that language components loader does not track changes or reload anything automatically (this would be a costly operation). If your lexical resources change often, you'll have to resort to ephemeral dictionaries or a custom implementation of filtering interfaces above.

Ephemeral dictionaries (lexical filters)

When filters for labels or stop words change frequently, reloading a full set of dictionaries would be slow and impractical. Ephemeral dictionaries are compiled for each clustering call and layered on top of the default LanguageComponents implementations. Please note that this adds some additional cost to each clustering call.

In the following example, the clustering algorithm uses the default language data for English but adds request-specific regular patterns to exclude undesirable labels:

// Load the default dictionaries for English.
LanguageComponents english =
    LanguageComponents.loader()
        .limitToLanguages("English")
        .limitToAlgorithms(new LingoClusteringAlgorithm())
        .load()
        .language("English");

LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();

// Create an ephemeral label filter by providing a dictionary with a
// few regexp exclusion patterns.
DefaultDictionaryImpl labelFilter = new DefaultDictionaryImpl();
labelFilter.regexp.set("(?i).*data.*", "(?i).*mining.*");
algorithm.dictionaries.labelFilters.set(List.of(labelFilter));

Tweaking LanguageComponents

When the default dictionaries or other components fall short of requirements, a custom replacement may be provided. A LanguageComponents instance consists of a set of interfaces and suppliers providing implementations of these interfaces. We can override any of these suppliers, creating a custom language preprocessing pipeline.

As an example, let's modify the English pipeline to use a custom word stemming component. The supplier for this component always returns the same closure, converting the input to lower case:

Supplier<Stemmer> stemmerSupplier;
stemmerSupplier = () -> (word) -> word.toString().toLowerCase(Locale.ROOT);

We will also replace the default StopwordFilter and LabelFilter components with custom filters based on hardcoded rules:

// Ignore words from the list and anything shorter than 4 characters.
final Set<String> ignored = new HashSet<>(Arrays.asList("from", "what"));
final StopwordFilter wordFilter =
    (word) -> {
      // Ignore any word shorter than 4 characters or on the explicit exclusion list.
      return word.length() >= 4 && !ignored.contains(word.toString());
    };

final LabelFilter labelFilter =
    (label) -> {
      // Ignore any label that has a substring 'data' in it.
      return !label.toString().toLowerCase(Locale.ROOT).contains("data");
    };

Note how the initialization of the ignored hash set is pulled out of the supplier. The cost of setting up a heavy component (creating the hash set itself in this case) can be high but once set up, suppliers should return instances of the component quickly.

Once we have custom suppliers, overriding their default implementations for English is trivial:

LanguageComponents customized =
    LanguageComponents.loader()
        .load()
        .language("English")
        .override(Stemmer.class, stemmerSupplier)
        // Word and label filters are thread-safe here so we
        // supply the same instance all the time.
        .override(StopwordFilter.class, () -> wordFilter)
        .override(LabelFilter.class, () -> labelFilter);

Support for other languages

The core Carrot2 JAR supports only a limited set of popular languages. Additional modules add support for other languages and bring in extra resources required for these languages to work properly.

All language extensions live under the org.carrot2.lang artifact group namespace. Note that many of them come with sizeable own dependencies like Apache Lucene analyzers or dictionaries.

  • carrot2-lang-lucene-chinese: Chinese (traditional and simplified).
  • carrot2-lang-lucene-japanese: Japanese.
  • carrot2-lang-lucene-korean: Korean.
  • carrot2-lang-lucene: adds support for many other languages supported by the Apache Lucene project.
  • carrot2-lang-polish: Polish.

A language extension is automatically added to the list of supported language components if it can be located and loaded by Java's standard ServiceLoader.