Algorithms and Concurrency

This article discusses approaches and pitfalls of running clustering algorithms in concurrent environments.

Thread-safety

The following sentences explain the contracts required for concurrent use of Carrot2 Java API components:

  • clustering algorithm instances are not thread-safe and cannot be used by multiple threads in parallel,
  • language component instances are thread-safe and should be used and reused by parallel threads.

Contrary to the 3.x line of Carrot2, the 4.x line has no "managed" support for caching, reusing or managing concurrency of clustering algorithm instances (previously provided by the Controller instance). This is a deliberate decision: algorithm instances are lightweight (cheap to create and discard) and modern JVMs have much better garbage collection mechanisms.

The following sections of this article show how various approaches to configuring the algorithm once and then reusing it in subsequent, possibly concurrent, clustering calls.

Ephemeral instances

The simplest way to achieve thread-safety is to create the algorithm instance on the fly, configure it appropriately and then discard it after the clustering completes. A simple pattern here would be to create a function that transforms a stream of documents into a list of clusters:

Function<Stream<Document>, List<Cluster<Document>>> processor =
    (documentStream) -> {
      // Algorithm instances are created per-call (per-thread)
      LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
      // ...configured in place
      algorithm.preprocessing.phraseDfThreshold.set(10);
      // and discarded once clustering call completes.
      return algorithm.cluster(documentStream, english);
    };

runConcurrentClustering(processor);

An important assumption here is that the LanguageComponents object (english in the example above) has been initialized beforehand (once) and is reused. See the Language components page for more information on initialization and customization of language resources.

Cloning preconfigured instances

Sometimes the configuration can become fairly complex. Clustering algorithm instances can be converted into and recreated from a map, so we can extract all the details from the preconfigured instance and then deep-clone it on demand, as this example shows:

// Apply any configuration tweaks once.
LingoClusteringAlgorithm preconfigured = new LingoClusteringAlgorithm();
preconfigured.preprocessing.phraseDfThreshold.set(10);
preconfigured.desiredClusterCount.set(10);
preconfigured.matrixReducer.factorizationFactory =
    new NonnegativeMatrixFactorizationKLFactory();

// Populate the map with algorithm and its attributes.
Map<String, Object> attrs = Attrs.toMap(preconfigured);

// Reuse the previously populated map to create a new cloned instance.
Function<Stream<Document>, List<Cluster<Document>>> processor =
    (documentStream) -> {
      ClusteringAlgorithm cloned;
      cloned = Attrs.fromMap(ClusteringAlgorithm.class, attrs);
      return cloned.cluster(documentStream, english);
    };

runConcurrentClustering(processor);