Java API Basics

This article will walk you through the basics of the Carrot2 Java API.

Dependencies

Carrot2 is a Java library (JAR) with artifacts published via Maven Central repositories. The following JAR contains the core APIs, language resources for English and the clustering algorithms:

<dependency>
  <groupid>org.carrot2</groupid>
  <artifactid>carrot2-core</artifactid>
  <version>4.0.0</version>
</dependency>

A corresponding Gradle dependency would look like this:

implementation "org.carrot2:carrot2-core:4.0.0"
Clustering non-English languages

Additional JARs may be required to cluster documents in languages other than English. Please refer to the Language components to see what's needed.

A binary Carrot2 distribution, including all examples, is always available at the project's GitHub releases page.

API workflow

Using Carrot2 Java API consists of the following steps:

  1. setting up heavy, thread-safe, reusable components (language dictionaries and auxiliary resources combined in an instance of the LanguageComponents class),

  2. setting up lightweight, per-thread components (clustering algorithms),

  3. preparing input documents and performing the actual clustering.

Heavy and lightweight components

Initialization of heavy components (LanguageComponents) may take significant time. Load them once and then reuse for all subsequent clustering calls. Heavy components are thread-safe and can be reused or shared between threads.

Lightweight components are cheap to instantiate so you can create a throw-away instance on-demand for each clustering call.

Clustering

The following example discusses code from the E01_ClusteringBasics.java example. It shows just the key elements required to process a stream of documents in English, without any parameter or language resource tuning.

First, let's load the heavy components: the default resources for the English language. The loaded LanguageComponents instance is thread-safe and should be reused for any subsequent calls to clustering algorithms.

// Our documents are in English so we load appropriate language resources.
// This call can be heavy and an instance of LanguageComponents should be
// created once and reused across different clustering calls.
LanguageComponents languageComponents = LanguageComponents.loader().load().language("English");

Now it's time to create the lightweight component: an instance of a clustering algorithm. Let's pick the Lingo algorithm explicitly:

LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();

Once we have the heavy and lightweight components initialized, we can assemble the input for clustering: a stream of Document instances. Each document must implement a single method that presents its clusterable text fields to the algorithm:

void visitFields(BiConsumer<String, String> fieldConsumer);

In this example we will use hardcoded values from an array available in the ExamplesData class:

static final String[][] DOCUMENTS_DATA_MINING = {
  {
    "http://en.wikipedia.org/wiki/Data_mining",
    "Data mining -" + " Wikipedia, the free " + "encyclopedia",
    "Article about knowledge-discovery in databases (KDD), the practice "
        + "of automatically "
        + "searching large stores of data for patterns."
  },

Because the field visitor interface is a single method, it can be implemented using a closure. We convert the above data array into document instances dynamically using Java streams. Note how we only expose the title and the snippet fields, the URL is omitted because it is not really clusterable text content.

// Create a stream of "documents" for clustering.
// Each such document provides text content fields to a visitor.
Stream<Document> documentStream =
    Arrays.stream(ExamplesData.DOCUMENTS_DATA_MINING)
        .map(
            fields ->
                (fieldVisitor) -> {
                  fieldVisitor.accept("title", fields[1]);
                  fieldVisitor.accept("content", fields[2]);
                });

Everything is now ready to call the clustering algorithm and consume the result. Here we just print the top-level cluster labels and document counts to the console:

// Perform clustering.
List<Cluster<Document>> clusters;
clusters = algorithm.cluster(documentStream, languageComponents);

// Print cluster labels and a document count in each top-level cluster.
for (Cluster<Document> c : clusters) {
  String label = String.join("; ", c.getLabels());
  System.out.println(label + ", documents: " + c.getDocuments().size());
}

When executed, this example should result in this output:

Data Mining Software, documents: 13
Knowledge Discovery, documents: 13
Data Mining Process, documents: 12
Data Mining Applications, documents: 11
Data Mining Techniques, documents: 11

Tweaking parameters

Each clustering algorithm will have a different set of knobs to adjust its behavior. All algorithms in Carrot2 have public fields, documented in the generated JavaDoc documentation, that allow direct changes of an algorithm's settings. For example:

LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
algorithm.desiredClusterCount.set(10);
algorithm.preprocessing.wordDfThreshold.set(5);
algorithm.preprocessing.phraseDfThreshold.set(5);
algorithm.preprocessing.documentAssigner.minClusterSize.set(4);

Note that even primitive-type attributes like wordDfThreshold above have to use setter methods to modify their value. This is because arguments are validated early and out-of-range or otherwise incorrect values will trigger exceptions at the exact moment they are set in the code.

When an attribute's type is an interface, the concrete type variable can be declared and configured prior to setting the final attribute. Local-variable type inference (var) available from Java 10 on is useful to keep these declaration compact and still have the compiler (and IDE) provide field suggestions upon typing:

var factorizationFactory = new LocalNonnegativeMatrixFactorizationFactory();
factorizationFactory.factorizationQuality.set(FactorizationQuality.HIGH);

algorithm.matrixReducer.factorizationFactory = factorizationFactory;

Once an algorithm is configured, it can be reused across many subsequent (sequential) clustering calls. The concurrency section discusses approaches to running algorithms in concurrent threads.