Algorithms

Carrot2 offers clustering algorithms of varying characteristics described below. There is also Lingo3G, a fast highly-tunable commercial algorithm that seamlessly integrates with Carrot2.

Each clustering algorithm tackles the problem of identifying groups of related documents in a different way. Therefore, for the same input the algorithms will produce different clusters and will have different performance characteristics.

For example, compare the following cluster labels produced for the same input (data mining search results) by our three clustering algorithms:

The same input clustered with different algorithms.

The same input ("data mining" search results) clustered with three different Carrot 2 clustering algorithms: STC (left), Lingo (middle) and k-means (right).

There is no single "best" text clustering algorithm. The choice will depend on the characteristics and size of the input texts as well as your particular needs and expectations. While we provide some general algorithm choice advice and performance tuning hints, you will ultimately need to experiment, tweak parameters, and select what fits your data and use case.

Lingo algorithm

The Lingo algorithm employs term-document matrix dimensionality reduction techniques to figure out the structure of topics present in the input. The algorithm is reasonably fast and nicely separates diverse topics into separate groups. This will typically be the default algorithm you may want to use.

STC algorithm

The key data structure in the STC algorithm is the Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred frequently in the input documents and merges sub-groups of documents with high overlap. The algorithm is very fast, even for large sets of documents, but suffers from the lack of cluster diversity.

Bisecting k-means

Bisecting k-means is an implementation of a generic cluster analysis technique, contrary to Lingo and STC which are text-specific. The algorithm attempts to recursively split the documents into smaller and smaller groups until the splits become unfeasible. The disadvantage of this algorithm is that it creates single-term labels, which are often subpar compared to phrase-based labels produced by STC or Lingo.

Lingo3G algorithm

Lingo3G is a commercial successor to the Lingo algorithm. The algorithm offers much higher performance, lower memory consumption and many more tuning options compared to Lingo. Please refer to Carrot Search Lingo3G web site for details.