Algorithms and Concurrency
This article discusses approaches and pitfalls of running clustering algorithms in concurrent environments.
Thread-safety
The following sentences explain the contracts required for concurrent use of Carrot2 Java API components:
- clustering algorithm instances are not thread-safe and cannot be used by multiple threads in parallel,
- language component instances are thread-safe and should be used and reused by parallel threads.
Contrary to the 3.x line of Carrot2, the 4.x line has no "managed" support
for caching, reusing or managing concurrency of clustering algorithm instances
(previously provided by the Controller
instance). This is a deliberate decision:
algorithm instances are lightweight (cheap to create and discard) and modern JVMs have much better
garbage collection mechanisms.
The following sections of this article show how various approaches to configuring the algorithm once and then reusing it in subsequent, possibly concurrent, clustering calls.
Ephemeral instances
The simplest way to achieve thread-safety is to create the algorithm instance on the fly, configure it appropriately and then discard it after the clustering completes. A simple pattern here would be to create a function that transforms a stream of documents into a list of clusters:
Function<Stream<Document>, List<Cluster<Document>>> processor =
(documentStream) -> {
// Algorithm instances are created per-call (per-thread)
LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
// ...configured in place
algorithm.preprocessing.phraseDfThreshold.set(10);
// and discarded once clustering call completes.
return algorithm.cluster(documentStream, english);
};
runConcurrentClustering(processor);
An important assumption here is that the LanguageComponents
object
(english
in the example above) has been
initialized beforehand (once) and is reused. See the
Language components page for more information
on initialization and customization of language resources.
Cloning preconfigured instances
Sometimes the configuration can become fairly complex. Clustering algorithm instances can be converted into and recreated from a map, so we can extract all the details from the preconfigured instance and then deep-clone it on demand, as this example shows:
// Apply any configuration tweaks once.
LingoClusteringAlgorithm preconfigured = new LingoClusteringAlgorithm();
preconfigured.preprocessing.phraseDfThreshold.set(10);
preconfigured.desiredClusterCount.set(10);
preconfigured.matrixReducer.factorizationFactory =
new NonnegativeMatrixFactorizationKLFactory();
// Populate the map with algorithm and its attributes.
Map<String, Object> attrs = Attrs.toMap(preconfigured);
// Reuse the previously populated map to create a new cloned instance.
Function<Stream<Document>, List<Cluster<Document>>> processor =
(documentStream) -> {
ClusteringAlgorithm cloned;
cloned = Attrs.fromMap(ClusteringAlgorithm.class, attrs);
return cloned.cluster(documentStream, english);
};
runConcurrentClustering(processor);