Choosing the algorithm

This section offers some suggestions on how to choose the clustering algorithm suitable for specific input data.

Algorithms to choose from

Carrot2 comes with a number of clustering algorithms of varying performance and quality characteristics. There is also Lingo3G, a fast highly-tunable commercial algorithm that seamlessly integrates with Carrot2.

Lingo

The Lingo algorithm employs term-document matrix dimensionality reduction techniques to figure out the structure of topics present in the input. The algorithm is reasonably fast and nicely separates diverse topics into separate groups. This will typically be the default algorithm you may want to use.

STC

The key data structure in the STC algorithm is the Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred frequently in the input documents and merges sub-groups of documents with high overlap. The algorithm is very fast, even for large sets of documents, but suffers from the lack of cluster diversity.

Bisecting k-means

Bisecting k-means is an implementation of a generic cluster analysis technique, contrary to Lingo and STC which are text-specific. The algorithm attempts to recursively split the documents into smaller and smaller groups until the splits become unfeasible. The disadvantage of this algorithm is that it creates single-term labels, which are often subpar compared to phrase-based labels produced by STC or Lingo.

Lingo3G

Lingo3G is a commercial successor to the Lingo algorithm. The algorithm offers much higher performance, lower memory consumption and many more tuning options compared to Lingo. Please refer to Carrot Search Lingo3G web site for details.

Algorithm characteristics

The following table summarizes key characteristics of each algorithm available in Carrot2.

Feature Lingo STC k-means
Cluster diversity High, many small (outlier) clusters highlighted Low, small (outlier) clusters rarely highlighted Low, small (outlier) clusters rarely highlighted
Cluster labels Longer, often more descriptive Shorter, but still appropriate One-word only, may not always describe all documents in the cluster
Scalability Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory. High Low, based on similar data structures as Lingo.
Overlapping clusters Yes. A document can belong to more than one cluster. Yes. A document can belong to more than one cluster. No. A document can belong to only one cluster.

A comparison of key characteristics of Carrot2 clustering algorithms.

Algorithm recommendations

It's quite difficult to give one clear recommendation as to which algorithm is "the best". The decision is influenced by multiple criteria such as performance or cluster label legibility.

Many people (including us) have a subjective feeling that the Lingo-family of algorithms (Lingo and Lingo3G) delivers more intuitive and diverse clusters compared to other algorithms. Sometimes it is other characteristics, such as performance or cluster structure, that should be taken into account.

With this in mind, the following table lists the scenarios for which the specific algorithm is best suited.

Requirement Use Lingo Use STC Use k-means Consider Lingo3G
Well-formed longer labels required
Highlighting of small (outlier) clusters required
High clustering performance or large document set processing required
Need non-overlapping clusters

Optimum usage scenarios for Carrot2 clustering algorithms.

While the table above can be useful to determine which algorithm to choose, the ultimate judgment should be based on an empirical evaluation with your specific documents Make sure to also have a look at performance tuning, and quality tuning tips.