Tuning performance

This document presents some hints to improve clustering performance, especially for larger data sets.

Carrot2 clustering algorithms have been designed to work really fast but the trade-off is that they store all the data structures in memory. The size of the Java virtual machine's heap will increase quickly with longer overall size of input text. Also, the more documents you put on input and the longer the documents are, the longer the clustering will take.

Below are a few generic guidelines on improving clustering performance.

Reduce size of input

In many cases clustering short document excerpts may work just as well or even better than full documents. Consider the possibility of replacing full content with:

  • query-matching document fragments (such as search result snippets), if input documents are a result of some type of user-entered query,
  • titles and abstracts of documents, if they are available,
  • just the leading few sentences or paragraphs of each document.

Batch and merge smaller clustering runs

In many cases when the input collection of documents is too large to cluster as a whole, dividing the input into smaller batches (or sampling smaller batches from the input), then clustering separately and finally merging based on cluster label text gives very reasonable results.

The above approach works because cluster labels recurring in smaller batches are very likely to be significant for the entire collection. The downside is that very small clusters containing just a few documents are likely to be lost during this process.

Tune algorithm attributes

In many cases the default attribute settings for each algorithm may not be suitable for very large inputs. Below are some attributes tweak suggestions you should consider. You will likely need to experiment a bit to adjust their values to match the size of your particular input.

STC, Bisecting k-Means

wordDfThreshold

Increase the minimum document frequency (minimum number of occurrences) of terms and phrases to a higher value. Something like 0.5% of the number of documents will typically work. For example, for a document collection of 5000 documents set the attribute to 25.

Lingo

wordDfThreshold, phraseDfThreshold

Increase the minimum document frequency (minimum number of occurrences) of terms and phrases to a higher value. Something like 0.5% of the number of documents will typically work. For example, for a document collection of 5000 documents set the attribute to 25.

factorizationQuality

For algorithm.matrixReducer.factorizationFactory implementations that support this attribute, lower factorizationQuality. This will cause the matrix factorization algorithm to perform fewer iterations and hence complete quicker.

Alternatively, you can set algorithm.matrixReducer.factorizationFactory to an implementation of PartialSingularValueDecompositionFactory, which is slightly faster than the other factorizations and does not have any explicit factorizationQuality attribute.

maximumMatrixSize

Lower maximum matrix size in matrixBuilder. This will cause the matrix factorization algorithm to complete quicker and use less memory. The tradeoff is that with small matrix sizes, Lingo may not be able to discover smaller clusters.