Language Components
In this section we describe the role and potential customisation options for language-specific heavyweight components.
LanguageComponent
instances provide assistance and hints to improve the
quality of clustering for a specific language. The resources associated with this task
typically require costly processing to load and parse, so LanguageComponent
instances should be created early and reused for all subsequent clustering
calls.
A single LanguageComponent
instance can provide a number of resources to aid
the algorithm in improving clustering quality:
- lemmatisation (stemming) routines,
- tokenisation (word decomposition) and decompounding routines,
- part of speech dictionaries,
- ignore lists (stop word lists).
The above list is not restricted or complete: each algorithm may require a different set of components to function properly and each language will have a specific set of components that makes sense for that particular language. All available languages and their provided components (interfaces) can be enumerated via service point extension, as shown below.
ServiceLoader<LanguageComponentsProvider> providers =
ServiceLoader.load(LanguageComponentsProvider.class);
for (LanguageComponentsProvider prov : providers) {
System.out.println("Provider class: " + prov.name());
for (String language : prov.languages()) {
System.out.println(" > " + language);
for (Class<?> componentClass : prov.componentTypes()) {
System.out.println(" Component: " + componentClass.getName());
}
}
}
The code above prints the following when executed, revealing interfaces of components provided by the default provider for English (other languages omitted):
Provider class: Carrot2 (English, Danish, Dutch, [...]
> English
Component: org.carrot2.language.Stemmer
Component: org.carrot2.language.Tokenizer
Component: org.carrot2.language.LexicalData
Component: org.carrot2.text.preprocessing.LabelFormatter
[...]
Each of the interfaces of a language component can be overridden. However, for the vast majority of applications only minor tweaks to the defaults are really needed. For example, adding ignorable terms and phrases specific to a given domain is a very helpful step to increase clustering quality (patent documents will have a different lexicon compared to an unrestricted e-mail corpus, for example). Let's take a look at how this is done first.
Clustering algorithms may support a subset of languages and language component instances
assembled
manually. Each clustering algorithm has a method called supports(LanguageComponents
lc)
which can be used to check whether the algorithm works with a given instance of language
components.
Tweaking lexical data resources
The LexicalData
component is responsible
for telling the algorithm whether a term or a candidate cluster
label should be ignored. Such ignorable elements typically include function
words like prepositions, conjunctions or domain-specific boilerplate language that
is not a good candidate for a cluster.
public interface LexicalData {
/** @return Return true if the provided term should be ignored in processing. */
boolean ignoreWord(CharSequence word);
/** @return Return true if the label candidate should be ignored in processing. */
boolean ignoreLabel(CharSequence labelCandidate);
}
The most common quality tuning step is modifying the
resources provided for the LexicalData
component's
default implementation (LexicalDataImpl
). This class
attempts to load two files (where the lang prefix is replaced with the
name of the language the component is instantiated for):
- lang.stopwords.utf8
-
A plain-text list of words to ignore. Each line contains exactly one full word (token).
Something like the following:
a about above across ...
- lang.stoplabels.utf8
-
A plain-text list of Java regular expression patterns to ignore. Each line contains exactly
one expression that, if it matches the input label, causes the label to be discarded and
ignored during clustering. An example stop labels file could look like this:
(?i)information (about|on).* (?i)(index|list) of.* ...
The LanguageComponentsLoader
class has methods to restrict
the number of loaded resources to only cover the needed languages (and/ or algorithms). There
is also a method for adding custom ResourceLookup
implementation, responsible
for locating any external resources components may require.
We can load our custom, modified resources using this facility without
touching any of the default components. It's a good idea to start tweaking custom resources
by first copying the defaults from Carrot2 JARs and then modifying them in
the new location.
ResourceLookup resLookup = new ClassRelativeResourceLookup(E03_CustomLanguageComponents.class);
LanguageComponents custom =
LanguageComponents.loader()
// Note we restrict languages to just English because resources for other languages
// are missing from the location of resource lookup and would have caused an exception.
.limitToLanguages("English")
// and we substitute resource lookup locations with our custom location.
.withResourceLookup(provider -> resLookup)
.load()
.language("English");
Once loaded, the LanguageComponents
object is ready to be used for subsequent
clustering calls.
Tweaking components directly
It may be sometimes desirable to provide an entirely custom implementation of a given
preprocessing component. An instance of LanguageComponents
class consists
of a set of components (interfaces) and suppliers providing implementations of these
interfaces. We can override any of these suppliers, creating a custom language preprocessing
pipeline.
As an example, let's modify the English pipeline to use a custom word stemming component. The supplier for this component always returns the same closure, converting the input to lower case:
Supplier<Stemmer> stemmerSupplier;
stemmerSupplier = () -> (word) -> word.toString().toLowerCase(Locale.ROOT);
We will also replace the default LexicalData
implementation with simple custom
filtering based on hardcoded rules:
final Set<String> ignored = new HashSet<>(Arrays.asList("from", "what"));
Supplier<LexicalData> lexicalDataSupplier =
() ->
new LexicalData() {
@Override
public boolean ignoreLabel(CharSequence candidate) {
// Ignore any label that has a substring 'data' in it.
return candidate.toString().toLowerCase(Locale.ROOT).contains("data");
}
@Override
public boolean ignoreWord(CharSequence word) {
return word.length() < 4 || ignored.contains(word.toString());
}
};
Note how the initialization of the ignored
set is pulled out of the supplier.
The cost of setting up a heavy component (the supplier itself here) can be high but
once set up, suppliers should return instances of a given component quickly.
Once we have custom suppliers, overriding their default implementations for English is trivial:
LanguageComponents customized =
LanguageComponents.loader()
.load()
.language("English")
.override(Stemmer.class, stemmerSupplier)
.override(LexicalData.class, lexicalDataSupplier);
Support for other languages
The core Carrot2 JAR supports only a limited set of popular languages. Additional modules add support for other languages and bring in extra resources required for these languages to work properly.
All language extensions live under the org.carrot2.lang
artifact
group namespace. Note that many of them come with sizeable
own dependencies like Apache Lucene
analyzers or dictionaries.
carrot2-lang-lucene-chinese
: Chinese (traditional and simplified).carrot2-lang-lucene-japanese
: Japanese.carrot2-lang-lucene-korean
: Korean.carrot2-lang-lucene
: adds support for many other languages supported by the Apache Lucene project.carrot2-lang-polish
: Polish.
A language extension is automatically added to the list of supported language components if
it can be located and loaded by Java's standard ServiceLoader
.