Choosing the algorithm
This section offers some suggestions on how to choose the clustering algorithm suitable for specific input data.
Algorithms to choose from
Carrot2 comes with a number of clustering algorithms of varying performance and quality characteristics. There is also Lingo3G, a fast highly-tunable commercial algorithm that seamlessly integrates with Carrot2.
Lingo
The Lingo algorithm employs term-document matrix dimensionality reduction techniques to figure out the structure of topics present in the input. The algorithm is reasonably fast and nicely separates diverse topics into separate groups. This will typically be the default algorithm you may want to use.
STC
The key data structure in the STC algorithm is the Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred frequently in the input documents and merges sub-groups of documents with high overlap. The algorithm is very fast, even for large sets of documents, but suffers from the lack of cluster diversity.
Bisecting k-means
Bisecting k-means is an implementation of a generic cluster analysis technique, contrary to Lingo and STC which are text-specific. The algorithm attempts to recursively split the documents into smaller and smaller groups until the splits become unfeasible. The disadvantage of this algorithm is that it creates single-term labels, which are often subpar compared to phrase-based labels produced by STC or Lingo.
Lingo3G
Lingo3G is a commercial successor to the Lingo algorithm. The algorithm offers much higher performance, lower memory consumption and many more tuning options compared to Lingo. Please refer to Carrot Search Lingo3G web site for details.
Algorithm characteristics
The following table summarizes key characteristics of each algorithm available in Carrot2.
Feature | Lingo | STC | k-means |
---|---|---|---|
Cluster diversity | High, many small (outlier) clusters highlighted | Low, small (outlier) clusters rarely highlighted | Low, small (outlier) clusters rarely highlighted |
Cluster labels | Longer, often more descriptive | Shorter, but still appropriate | One-word only, may not always describe all documents in the cluster |
Scalability | Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory. | High | Low, based on similar data structures as Lingo. |
Overlapping clusters | Yes. A document can belong to more than one cluster. | Yes. A document can belong to more than one cluster. | No. A document can belong to only one cluster. |
A comparison of key characteristics of Carrot2 clustering algorithms.
Algorithm recommendations
It's quite difficult to give one clear recommendation as to which algorithm is "the best". The decision is influenced by multiple criteria such as performance or cluster label legibility.
Many people (including us) have a subjective feeling that the Lingo-family of algorithms (Lingo and Lingo3G) delivers more intuitive and diverse clusters compared to other algorithms. Sometimes it is other characteristics, such as performance or cluster structure, that should be taken into account.
With this in mind, the following table lists the scenarios for which the specific algorithm is best suited.
Requirement | Use Lingo | Use STC | Use k-means | Consider Lingo3G |
---|---|---|---|---|
Well-formed longer labels required | ✔ | ✔ | ||
Highlighting of small (outlier) clusters required | ✔ | ✔ | ||
High clustering performance or large document set processing required | ✔ | ✔ | ||
Need non-overlapping clusters | ✔ |
Optimum usage scenarios for Carrot2 clustering algorithms.
While the table above can be useful to determine which algorithm to choose, the ultimate judgment should be based on an empirical evaluation with your specific documents Make sure to also have a look at performance tuning, and quality tuning tips.