Java API Basics
This article will walk you through the basics of the Carrot2 Java API.
Dependencies
Carrot2 is a Java library (JAR) with artifacts published via Maven Central repositories. The following JAR contains the core APIs, language resources for English and the clustering algorithms:
<dependency>
<groupid>org.carrot2</groupid>
<artifactid>carrot2-core</artifactid>
<version>4.3.0</version>
</dependency>
A corresponding Gradle dependency would look like this:
implementation "org.carrot2:carrot2-core:4.3.0"
Additional JARs may be required to cluster documents in languages other than English. Please refer to the Language components to see what's needed.
A binary Carrot2 distribution, including all examples, is always available at the project's GitHub releases page.
API workflow
Using Carrot2 Java API consists of the following steps:
-
setting up heavy, thread-safe, reusable components (language dictionaries and auxiliary resources combined in an instance of the
LanguageComponents
class), -
setting up lightweight, per-thread components (clustering algorithms),
-
preparing input documents and performing the actual clustering.
Heavy and lightweight components
Initialization of heavy components (LanguageComponents
) may take significant time.
Load them once and then reuse for all subsequent clustering calls.
Heavy components are thread-safe and can be reused or shared between threads.
Lightweight components, such as clustering algorithm instances, are cheap to instantiate so you can create a throw-away instance on-demand for each clustering call.
Clustering
This section discusses code from the E01_ClusteringBasics.java
example. It shows just the key elements required to process a stream of documents
in English, without any parameter or language resource tuning.
First, let's load the heavy components: the default resources for the English language. The
loaded LanguageComponents
instance is thread-safe and should be reused for
any subsequent calls to clustering algorithms.
// Our documents are in English so we load appropriate language resources.
// This call can be heavy and an instance of LanguageComponents should be
// created once and reused across different clustering calls.
LanguageComponents languageComponents = LanguageComponents.loader().load().language("English");
Now it's time to create the lightweight component: an instance of a clustering algorithm. Let's pick the Lingo algorithm explicitly:
LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
Once we have the heavy and lightweight components initialized, we can assemble the input
for clustering: a stream of Document
instances. Each document must implement a
single method that presents its clusterable text fields to the algorithm:
void visitFields(BiConsumer<String, String> fieldConsumer);
In this example we will use hardcoded values from an array available in the
ExamplesData
class:
static final String[][] DOCUMENTS_DATA_MINING = {
{
"http://en.wikipedia.org/wiki/Data_mining",
"Data mining -" + " Wikipedia, the free " + "encyclopedia",
"Article about knowledge-discovery in databases (KDD), the practice "
+ "of automatically "
+ "searching large stores of data for patterns."
},
Because the field visitor interface is a single method, it can be implemented using a closure. We convert the above data array into document instances dynamically using Java streams. Note how we only expose the title and the snippet fields, the URL is omitted because it is not really clusterable text content.
// Create a stream of "documents" for clustering.
// Each such document provides text content fields to a visitor.
Stream<Document> documentStream =
Arrays.stream(ExamplesData.DOCUMENTS_DATA_MINING)
.map(
fields ->
(fieldVisitor) -> {
fieldVisitor.accept("title", fields[1]);
fieldVisitor.accept("content", fields[2]);
});
Everything is now ready to call the clustering algorithm and consume the result. Here we just print the top-level cluster labels and document counts to the console:
// Perform clustering.
List<Cluster<Document>> clusters;
clusters = algorithm.cluster(documentStream, languageComponents);
// Print cluster labels and a document count in each top-level cluster.
for (Cluster<Document> c : clusters) {
String label = String.join("; ", c.getLabels());
System.out.println(label + ", documents: " + c.getDocuments().size());
}
When executed, this example should result in this output:
Data Mining Software, documents: 13
Knowledge Discovery, documents: 13
Data Mining Process, documents: 12
Data Mining Applications, documents: 11
Data Mining Techniques, documents: 11
Tweaking parameters
Each clustering algorithm will have a different set of knobs to adjust its behavior. All algorithms in Carrot2 have public fields, documented in the generated JavaDoc documentation, that allow direct changes of an algorithm's settings. For example:
LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
algorithm.desiredClusterCount.set(10);
algorithm.preprocessing.wordDfThreshold.set(5);
algorithm.preprocessing.phraseDfThreshold.set(5);
algorithm.preprocessing.documentAssigner.minClusterSize.set(4);
Note that even primitive-type parameters like wordDfThreshold
above have to use
setter methods to modify their value. This is because arguments are validated early and
out-of-range or otherwise incorrect values will trigger exceptions at the exact moment they
are set in the code.
When a parameter's type is an interface, the concrete type variable can
be declared and configured prior to setting the final parameter. Local-variable
type inference (var
) available from Java 10 on is useful to keep these
declaration compact and still have the compiler (and IDE) provide field suggestions upon
typing:
var factorizationFactory = new LocalNonnegativeMatrixFactorizationFactory();
factorizationFactory.factorizationQuality.set(FactorizationQuality.HIGH);
algorithm.matrixReducer.factorizationFactory = factorizationFactory;
Once an algorithm is configured, it can be reused across many subsequent (sequential) clustering calls. The concurrency section discusses approaches to running algorithms in concurrent threads.
Customizing dictionaries
The simplest way to customize word and label
filter dictionaries is to copy the
default
dictionaries to your application-specific location and provide
a custom ResourceLookup
implementation when loading language
resources. The following example loads English resources from a class-relative
classpath location.
The Language Components section describes Carrot2 language resource management in detail, including ephemeral dictionaries for temporary per-clustering word and label filtering, LanguageComponents customization for replacing a stemmer or custom filtering, support for other languages.