Language Components
In this section we describe the role and potential customisation options for language-specific heavyweight components.
    LanguageComponent instances provide assistance and hints to improve the
    quality of clustering for a specific language. The resources associated with this task
    typically require costly processing to load and parse, so LanguageComponent
    instances should be created early and reused for all subsequent clustering
    calls.
  
    A single LanguageComponent instance can provide a number of resources to aid
    the algorithm in improving clustering quality:
  
- lemmatisation (stemming) routines,
- tokenisation (word decomposition) and decompounding routines,
- part of speech dictionaries,
- word or label filters (dictionaries).
The above list is not restricted or complete: each algorithm may require a different set of components to function properly and each language will have a specific set of components that makes sense for that particular language. All available languages and their provided components (interfaces) can be enumerated via service point extension, as shown below.
ServiceLoader<LanguageComponentsProvider> providers =
    ServiceLoader.load(LanguageComponentsProvider.class);
Map<String, List<LanguageComponentsProvider>> langToProviders = new TreeMap<>();
for (LanguageComponentsProvider prov : providers) {
  for (String lang : prov.languages()) {
    langToProviders.computeIfAbsent(lang, (k) -> new ArrayList<>()).add(prov);
  }
}
langToProviders.forEach(
    (language, provList) -> {
      System.out.println("  > " + language);
      provList.forEach(
          provider -> {
            System.out.println("    [Provider: " + provider.name() + "]");
            for (Class<?> componentClass : provider.componentTypes()) {
              System.out.println("      Component: " + componentClass.getName());
            }
          });
    });The code above prints the following when executed, revealing providers and components supplied for the English language (other languages omitted):
> English
  [Provider: Carrot2 Core (Stemmers)]
    Component: org.carrot2.language.Stemmer
  [Provider: Carrot2 Core (Tokenizers)]
    Component: org.carrot2.language.Tokenizer
  [Provider: Carrot2 Core (Lexical Data)]
    Component: org.carrot2.language.LabelFilter
    Component: org.carrot2.language.StopwordFilter
  [Provider: Carrot2 Core (Label Formatter)]
    Component: org.carrot2.text.preprocessing.LabelFormatter
[...]Each of the interfaces of a language component can be overridden. However, for the vast majority of applications only minor tweaks to the defaults are really needed. For example, adding ignorable terms and phrases specific to a given domain is a very helpful step to increase clustering quality (patent documents will have a different lexicon compared to an unrestricted e-mail corpus, for example). Let's take a look at how this is done first.
      A clustering algorithm in Carrot2 may support only a subset of the languages
      available from the LanguageComponentsProvider extension point.
      Each clustering algorithm has a method called supports(LanguageComponents lc)
      which can be used to test whether the algorithm works with a given language.
    
Tweaking lexical filters
Lexical filters are responsible for telling the algorithm whether a term or a candidate cluster label should be ignored. Such ignorable elements typically include function words like prepositions, conjunctions or domain-specific boilerplate language that is not a good candidate for a cluster.
      The Java API defines two components (interfaces) that are part of each loaded
      LanguageComponents instance:
    
- 
        The StopwordFilterinterface controls low-level filtering of individual terms:public interface StopwordFilter extends Predicate<CharSequence> { /** * @param word The word to test. Input words are guaranteed to be in lower case (consistent with * {@link Character#toLowerCase(int)}. * @return Return {@code false} if the provided term should be ignored in processing. */ boolean test(CharSequence word);
- 
        The LabelFilterinterface controls low-level filtering of undesired cluster labels.public interface LabelFilter extends Predicate<CharSequence> { /** * @param label The label to test. Input labels may have mixed case, depending on the algorithm * and their surface forms collected from input documents. * @return Return {@code false} if the label candidate should be ignored in processing. */ boolean test(CharSequence label);
      Both these components can be explicitly provided (or overridden at runtime) in a
      LanguageComponents
      instance but implementing a custom filter is often impractical.
      It is much easier to reuse the default implementation that builds
      filters using filter dictionaries encoded in JSON.
    
      To instantiate LanguageComponents for a particular language with a custom
      set of dictionary resources, copy the
      default resources
      shipped with the distribution somewhere where your application
      can load them.
    
      The LanguageComponentsLoader class has a method to provide
      a custom resource loader (ResourceLookup implementation)
      responsible for locating any external resources language data may ask for.
    
      To restrict the number of loaded resources to only a subset of all available languages
      (and/ or algorithms), use the loader's limitToLanguages and
      limitToAlgorithms methods. Here is a full example that
      loads English resources from a class-relative classpath location:
    
LanguageComponents custom =
    LanguageComponents.loader()
        // Note we restrict languages to just English because resources for
        // other languages are missing from the location of resource lookup
        // and would have caused an exception.
        .limitToLanguages("English")
        // and we substitute resource lookup locations with our custom location.
        .withResourceLookup(provider -> new ClassRelativeResourceLookup(E03_Languages.class))
        .load()
        .language("English");
      Once loaded, the LanguageComponents object is ready to be used for subsequent
      clustering calls.
    
Note that language components loader does not track changes or reload anything automatically (this would be a costly operation). If your lexical resources change often, you'll have to resort to ephemeral dictionaries or a custom implementation of filtering interfaces above.
Ephemeral dictionaries (lexical filters)
      When filters for labels or stop words change frequently, reloading
      a full set of dictionaries would be slow and impractical. Ephemeral dictionaries
      are compiled for each clustering call and layered on top
      of the default LanguageComponents implementations.
      Please note that this adds some additional cost to each clustering call.
    
In the following example, the clustering algorithm uses the default language data for English but adds request-specific regular patterns to exclude undesirable labels:
// Load the default dictionaries for English.
LanguageComponents english =
    LanguageComponents.loader()
        .limitToLanguages("English")
        .limitToAlgorithms(new LingoClusteringAlgorithm())
        .load()
        .language("English");
LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
// Create an ephemeral label filter by providing a dictionary with a
// few regexp exclusion patterns.
DefaultDictionaryImpl labelFilter = new DefaultDictionaryImpl();
labelFilter.regexp.set("(?i).*data.*", "(?i).*mining.*");
algorithm.dictionaries.labelFilters.set(List.of(labelFilter));
        Tweaking LanguageComponents
      
    
      When the default dictionaries or other components
      fall short of requirements, a custom replacement may be provided.
      A LanguageComponents instance consists
      of a set of interfaces and suppliers providing implementations of these
      interfaces. We can override any of these suppliers, creating a custom language preprocessing
      pipeline.
    
As an example, let's modify the English pipeline to use a custom word stemming component. The supplier for this component always returns the same closure, converting the input to lower case:
Supplier<Stemmer> stemmerSupplier;
stemmerSupplier = () -> (word) -> word.toString().toLowerCase(Locale.ROOT);
      We will also replace the default StopwordFilter and LabelFilter components
      with custom filters based on hardcoded rules:
    
// Ignore words from the list and anything shorter than 4 characters.
final Set<String> ignored = new HashSet<>(Arrays.asList("from", "what"));
final StopwordFilter wordFilter =
    (word) -> {
      // Ignore any word shorter than 4 characters or on the explicit exclusion list.
      return word.length() >= 4 && !ignored.contains(word.toString());
    };
final LabelFilter labelFilter =
    (label) -> {
      // Ignore any label that has a substring 'data' in it.
      return !label.toString().toLowerCase(Locale.ROOT).contains("data");
    };
      Note how the initialization of the ignored hash set is pulled out of the supplier.
      The cost of setting up a heavy component (creating the hash set itself in this case) can be high but
      once set up, suppliers should return instances of the component quickly.
    
Once we have custom suppliers, overriding their default implementations for English is trivial:
LanguageComponents customized =
    LanguageComponents.loader()
        .load()
        .language("English")
        .override(Stemmer.class, stemmerSupplier)
        // Word and label filters are thread-safe here so we
        // supply the same instance all the time.
        .override(StopwordFilter.class, () -> wordFilter)
        .override(LabelFilter.class, () -> labelFilter);Support for other languages
The core Carrot2 JAR supports only a limited set of popular languages. Additional modules add support for other languages and bring in extra resources required for these languages to work properly.
      All language extensions live under the org.carrot2.lang artifact
      group namespace. Note that many of them come with sizeable
      own dependencies like Apache Lucene
      analyzers or dictionaries.
    
- carrot2-lang-lucene-chinese: Chinese (traditional and simplified).
- carrot2-lang-lucene-japanese: Japanese.
- carrot2-lang-lucene-korean: Korean.
- carrot2-lang-lucene: adds support for many other languages supported by the Apache Lucene project.
- carrot2-lang-polish: Polish.
      A language extension is automatically added to the list of supported language components if
      it can be located and loaded by Java's standard ServiceLoader.