Choosing the algorithm

This section offers some suggestions on how to choose the clustering algorithm suitable for specific input data.

Algorithms to choose from

Carrot² comes with a number of clustering algorithms of varying performance and quality characteristics. There is also Lingo3G, a fast highly-tunable commercial algorithm that seamlessly integrates with Carrot².

Lingo

The Lingo algorithm employs term-document matrix dimensionality reduction techniques to figure out the structure of topics present in the input. The algorithm is reasonably fast and nicely separates diverse topics into separate groups. This will typically be the default algorithm you may want to use.

STC

The key data structure in the STC algorithm is the Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred frequently in the input documents and merges sub-groups of documents with high overlap. The algorithm is very fast, even for large sets of documents, but suffers from the lack of cluster diversity.

Bisecting k-means

Bisecting k-means is an implementation of a generic cluster analysis technique, contrary to Lingo and STC which are text-specific. The algorithm attempts to recursively split the documents into smaller and smaller groups until the splits become unfeasible. The disadvantage of this algorithm is that it creates single-term labels, which are often subpar compared to phrase-based labels produced by STC or Lingo.

Lingo3G

Lingo3G is a commercial successor to the Lingo algorithm. The algorithm offers much higher performance, lower memory consumption and many more tuning options compared to Lingo. Please refer to Carrot Search Lingo3G web site for details.

Algorithm characteristics

The following table summarizes key characteristics of each algorithm available in Carrot².

Feature	Lingo	STC	k-means
Cluster diversity	High, many small (outlier) clusters highlighted	Low, small (outlier) clusters rarely highlighted	Low, small (outlier) clusters rarely highlighted
Cluster labels	Longer, often more descriptive	Shorter, but still appropriate	One-word only, may not always describe all documents in the cluster
Scalability	Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory.	High	Low, based on similar data structures as Lingo.
Overlapping clusters	Yes. A document can belong to more than one cluster.	Yes. A document can belong to more than one cluster.	No. A document can belong to only one cluster.

A comparison of key characteristics of Carrot² clustering algorithms.

Algorithm recommendations

It's quite difficult to give one clear recommendation as to which algorithm is "the best". The decision is influenced by multiple criteria such as performance or cluster label legibility.

Many people (including us) have a subjective feeling that the Lingo-family of algorithms (Lingo and Lingo3G) delivers more intuitive and diverse clusters compared to other algorithms. Sometimes it is other characteristics, such as performance or cluster structure, that should be taken into account.

With this in mind, the following table lists the scenarios for which the specific algorithm is best suited.

Requirement	Use Lingo	Use STC	Use k-means	Consider Lingo3G
Well-formed longer labels required	✔			✔
Highlighting of small (outlier) clusters required	✔			✔
High clustering performance or large document set processing required		✔		✔
Need non-overlapping clusters			✔

Optimum usage scenarios for Carrot² clustering algorithms.

While the table above can be useful to determine which algorithm to choose, the ultimate judgment should be based on an empirical evaluation with your specific documents Make sure to also have a look at performance tuning, and quality tuning tips.

previous article
Dictionaries

next article
Tuning clustering

API elements

Sections and content