Language Components
In this section we describe the role and potential customisation options for language-specific heavyweight components.
LanguageComponent
instances provide assistance and hints to improve the
quality of clustering for a specific language. The resources associated with this task
typically require costly processing to load and parse, so LanguageComponent
instances should be created early and reused for all subsequent clustering
calls.
A single LanguageComponent
instance can provide a number of resources to aid
the algorithm in improving clustering quality:
- lemmatisation (stemming) routines,
- tokenisation (word decomposition) and decompounding routines,
- part of speech dictionaries,
- word or label filters (dictionaries).
The above list is not restricted or complete: each algorithm may require a different set of components to function properly and each language will have a specific set of components that makes sense for that particular language. All available languages and their provided components (interfaces) can be enumerated via service point extension, as shown below.
ServiceLoader<LanguageComponentsProvider> providers =
ServiceLoader.load(LanguageComponentsProvider.class);
Map<String, List<LanguageComponentsProvider>> langToProviders = new TreeMap<>();
for (LanguageComponentsProvider prov : providers) {
for (String lang : prov.languages()) {
langToProviders.computeIfAbsent(lang, (k) -> new ArrayList<>()).add(prov);
}
}
langToProviders.forEach(
(language, provList) -> {
System.out.println(" > " + language);
provList.forEach(
provider -> {
System.out.println(" [Provider: " + provider.name() + "]");
for (Class<?> componentClass : provider.componentTypes()) {
System.out.println(" Component: " + componentClass.getName());
}
});
});
x The code above prints the following when executed, revealing providers and components supplied for the English language (other languages omitted):
> English
[Provider: Carrot2 Core (Stemmers)]
Component: org.carrot2.language.Stemmer
[Provider: Carrot2 Core (Tokenizers)]
Component: org.carrot2.language.Tokenizer
[Provider: Carrot2 Core (Lexical Data)]
Component: org.carrot2.language.LabelFilter
Component: org.carrot2.language.StopwordFilter
[Provider: Carrot2 Core (Label Formatter)]
Component: org.carrot2.text.preprocessing.LabelFormatter
[...]
Each of the interfaces of a language component can be overridden. However, for the vast majority of applications only minor tweaks to the defaults are really needed. For example, adding ignorable terms and phrases specific to a given domain is a very helpful step to increase clustering quality (patent documents will have a different lexicon compared to an unrestricted e-mail corpus, for example). Let's take a look at how this is done first.
A clustering algorithm in Carrot2 may support only a subset of the languages
available from the LanguageComponentsProvider
extension point.
Each clustering algorithm has a method called supports(LanguageComponents lc)
which can be used to test whether the algorithm works with a given language.
Tweaking lexical filters
Lexical filters are responsible for telling the algorithm whether a term or a candidate cluster label should be ignored. Such ignorable elements typically include function words like prepositions, conjunctions or domain-specific boilerplate language that is not a good candidate for a cluster.
The Java API defines two components (interfaces) that are part of each loaded
LanguageComponents
instance:
-
The
StopwordFilter
interface controls low-level filtering of individual terms:public interface StopwordFilter extends Predicate<CharSequence> { /** * @param word The word to test. Input words are guaranteed to be in lower case (consistent with * {@link Character#toLowerCase(int)}. * @return Return {@code false} if the provided term should be ignored in processing. */ boolean test(CharSequence word);
-
The
LabelFilter
interface controls low-level filtering of undesired cluster labels.public interface LabelFilter extends Predicate<CharSequence> { /** * @param label The label to test. Input labels may have mixed case, depending on the algorithm * and their surface forms collected from input documents. * @return Return {@code false} if the label candidate should be ignored in processing. */ boolean test(CharSequence label);
Both these components can be explicitly provided (or overridden at runtime) in a
LanguageComponents
instance but implementing a custom filter is often impractical.
It is much easier to reuse the default implementation that builds
filters using filter dictionaries encoded in JSON.
To instantiate LanguageComponents
for a particular language with a custom
set of dictionary resources, copy the
default resources
shipped with the distribution somewhere where your application
can load them.
The LanguageComponentsLoader
class has a method to provide
a custom resource loader (ResourceLookup
implementation)
responsible for locating any external resources language data may ask for.
To restrict the number of loaded resources to only a subset of all available languages
(and/ or algorithms), use the loader's limitToLanguages
and
limitToAlgorithms
methods. Here is a full example that
loads English resources from a class-relative classpath location:
LanguageComponents custom =
LanguageComponents.loader()
// Note we restrict languages to just English because resources for
// other languages are missing from the location of resource lookup
// and would have caused an exception.
.limitToLanguages("English")
// and we substitute resource lookup locations with our custom location.
.withResourceLookup(provider -> new ClassRelativeResourceLookup(E03_Languages.class))
.load()
.language("English");
Once loaded, the LanguageComponents
object is ready to be used for subsequent
clustering calls.
Note that language components loader does not track changes or reload anything automatically (this would be a costly operation). If your lexical resources change often, you'll have to resort to ephemeral dictionaries or a custom implementation of filtering interfaces above.
Ephemeral dictionaries (lexical filters)
When filters for labels or stop words change frequently, reloading
a full set of dictionaries would be slow and impractical. Ephemeral dictionaries
are compiled for each clustering call and layered on top
of the default LanguageComponents
implementations.
Please note that this adds some additional cost to each clustering call.
In the following example, the clustering algorithm uses the default language data for English but adds request-specific regular patterns to exclude undesirable labels:
// Load the default dictionaries for English.
LanguageComponents english =
LanguageComponents.loader()
.limitToLanguages("English")
.limitToAlgorithms(new LingoClusteringAlgorithm())
.load()
.language("English");
LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
// Create an ephemeral label filter by providing a dictionary with a
// few regexp exclusion patterns.
DefaultDictionaryImpl labelFilter = new DefaultDictionaryImpl();
labelFilter.regexp.set("(?i).*data.*", "(?i).*mining.*");
algorithm.dictionaries.labelFilters.set(List.of(labelFilter));
LanguageComponents
Tweaking
When the default dictionaries or other components
fall short of requirements, a custom replacement may be provided.
A LanguageComponents
instance consists
of a set of interfaces and suppliers providing implementations of these
interfaces. We can override any of these suppliers, creating a custom language preprocessing
pipeline.
As an example, let's modify the English pipeline to use a custom word stemming component. The supplier for this component always returns the same closure, converting the input to lower case:
Supplier<Stemmer> stemmerSupplier;
stemmerSupplier = () -> (word) -> word.toString().toLowerCase(Locale.ROOT);
We will also replace the default StopwordFilter
and LabelFilter
components
with custom filters based on hardcoded rules:
// Ignore words from the list and anything shorter than 4 characters.
final Set<String> ignored = new HashSet<>(Arrays.asList("from", "what"));
final StopwordFilter wordFilter =
(word) -> {
// Ignore any word shorter than 4 characters or on the explicit exclusion list.
return word.length() >= 4 && !ignored.contains(word.toString());
};
final LabelFilter labelFilter =
(label) -> {
// Ignore any label that has a substring 'data' in it.
return !label.toString().toLowerCase(Locale.ROOT).contains("data");
};
Note how the initialization of the ignored
hash set is pulled out of the supplier.
The cost of setting up a heavy component (creating the hash set itself in this case) can be high but
once set up, suppliers should return instances of the component quickly.
Once we have custom suppliers, overriding their default implementations for English is trivial:
LanguageComponents customized =
LanguageComponents.loader()
.load()
.language("English")
.override(Stemmer.class, stemmerSupplier)
// Word and label filters are thread-safe here so we
// supply the same instance all the time.
.override(StopwordFilter.class, () -> wordFilter)
.override(LabelFilter.class, () -> labelFilter);
Support for other languages
The core Carrot2 JAR supports only a limited set of popular languages. Additional modules add support for other languages and bring in extra resources required for these languages to work properly.
All language extensions live under the org.carrot2.lang
artifact
group namespace. Note that many of them come with sizeable
own dependencies like Apache Lucene
analyzers or dictionaries.
carrot2-lang-lucene-chinese
: Chinese (traditional and simplified).carrot2-lang-lucene-japanese
: Japanese.carrot2-lang-lucene-korean
: Korean.carrot2-lang-lucene
: adds support for many other languages supported by the Apache Lucene project.carrot2-lang-polish
: Polish.
A language extension is automatically added to the list of supported language components if
it can be located and loaded by Java's standard ServiceLoader
.