Tuning performance
This document presents some hints to improve clustering performance, especially for larger data sets.
Carrot2 clustering algorithms have been designed to work really fast but the trade-off is that they store all the data structures in memory. The size of the Java virtual machine's heap will increase quickly with longer overall size of input text. Also, the more documents you put on input and the longer the documents are, the longer the clustering will take.
Below are a few generic guidelines on improving clustering performance.
Reduce size of input
In many cases clustering short document excerpts may work just as well or even better than full documents. Consider the possibility of replacing full content with:
- query-matching document fragments (such as search result snippets), if input documents are a result of some type of user-entered query,
- titles and abstracts of documents, if they are available,
- just the leading few sentences or paragraphs of each document.
Batch and merge smaller clustering runs
In many cases when the input collection of documents is too large to cluster as a whole, dividing the input into smaller batches (or sampling smaller batches from the input), then clustering separately and finally merging based on cluster label text gives very reasonable results.
The above approach works because cluster labels recurring in smaller batches are very likely to be significant for the entire collection. The downside is that very small clusters containing just a few documents are likely to be lost during this process.
Tune algorithm attributes
In many cases the default attribute settings for each algorithm may not be suitable for very large inputs. Below are some attributes tweak suggestions you should consider. You will likely need to experiment a bit to adjust their values to match the size of your particular input.
STC, Bisecting k-Means
wordDfThreshold
Increase the minimum document frequency (minimum number of occurrences) of terms and phrases to a higher value. Something like 0.5% of the number of documents will typically work. For example, for a document collection of 5000 documents set the attribute to 25.
Lingo
wordDfThreshold
, phraseDfThreshold
Increase the minimum document frequency (minimum number of occurrences) of terms and phrases to a higher value. Something like 0.5% of the number of documents will typically work. For example, for a document collection of 5000 documents set the attribute to 25.
factorizationQuality
For algorithm.matrixReducer.factorizationFactory
implementations that support
this attribute, lower factorizationQuality
. This will cause the
matrix factorization algorithm to perform fewer iterations and hence complete quicker.
Alternatively, you can set algorithm.matrixReducer.factorizationFactory
to an implementation of PartialSingularValueDecompositionFactory
, which is
slightly faster than the other factorizations and does not have
any explicit factorizationQuality
attribute.
maximumMatrixSize
Lower maximum matrix size in matrixBuilder
. This will cause the matrix
factorization algorithm to complete quicker and use less memory. The tradeoff is that with
small matrix sizes, Lingo may not be able to discover smaller clusters.