Tuning quality
This article discusses approaches to tuning Carrot2 clustering algorithms.
Desirable characteristics of input documents
The quality of clusters and their labels largely depends on the quality of documents provided on the input. Although there is no general rule for optimum document content, below are some tips worth considering.
-
Carrot2 is designed for small to medium collections of documents. Carrot2 algorithms perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, check out the Apache Mahout project or Carrot Search Lingo4G.
-
Provide a minimum of 100 documents. Carrot2 clustering algorithms work better with increasing amount of data. A hundred documents on input is probably the minimum for any statistical significance of features discovered by the algorithm. In general, an optimum number of documents would probably fall between 100 and a few thousands. More than that may cause problems due to in-memory processing.
-
Provide query-context snippets instead of entire documents. If the input documents are a result of some search query, provide contextual snippets related to that query (similar to what web search engines return), instead of full document content. Not only will this speed up processing, but should also guide the clustering algorithm to discover query-related spectrum of topics.
-
Minimize noise in the input documents. All kinds of noisy fragments in the input like truncated sentences, random alphanumerical strings or repeated boilerplate may decrease the quality of cluster labels. If you are extracting query context for clustering, retrieving complete sentences instead of truncated fragments should improve cluster labels even further.
Choosing the clustering algorithm
Carrot2 offers several clustering algorithms. These algorithms differ in algorithmic approaches to solving the problem and vary in performance characteristics and in how they find and then label document clusters.
It's quite difficult to give one clear recommendation as to which algorithm is "the best": the decision is a problem of weighing multiple criteria such as performance or cluster label legibility. Compare the following three sets of cluster labels acquired for the same input using different algorithms:
Many people (including us) have a subjective feeling that the Lingo-family of algorithms (Lingo and Lingo3G) delivers more intuitive and diverse clusters compared to other algorithms. Sometimes it is other characteristics, such as performance or cluster structure, that should be taken into account.
The following table summarizes key characteristics of each algorithm shipped in Carrot2.
Feature | Lingo | STC | k-means |
---|---|---|---|
Cluster diversity | High, many small (outlier) clusters highlighted | Low, small (outlier) clusters rarely highlighted | Low, small (outlier) clusters rarely highlighted |
Cluster labels | Longer, often more descriptive | Shorter, but still appropriate | One-word only, may not always describe all documents in the cluster |
Scalability |
Low. For more than about 1000 documents, Lingo clustering will take a
long time and large memory
|
High | Low, based on similar data structures as Lingo. |
Overlapping clusters | Yes. A document can belong to more than one cluster. | Yes. A document can belong to more than one cluster. | No. A document can belong to only one cluster. |
Lingo3G, the commercial algorithm that plugs into Carrot2 has a separate page where its features are compared against the open source algorithms.
While the table above can be useful to determine which algorithm to choose, the ultimate judgment should be based on an empirical evaluation with your specific document collection.
Tweaking parameters
The default settings that come with each algorithm are good for an average case. Each algorithm comes with a number of attributes which can be used to tune its output and runtime characteristics. You can easily adjust attributes in Java API. An robust way to override attributes in REST API is to use request model classes.
In order to achieve good cluster labels and high clustering quality it is vital to adjust the default language resources so that they exclude any common terms, phrases and expressions specific to:
- the selected language (the default resources should contain a reasonable number of these already),
- the domain of documents being clustered.
For example, if clustering documents from the medical domain, certain expressions and terms may be obvious and trivial to the domain. Forming clusters out of these wouldn't be of any value to the users. By excluding such expressions the algorithms are guided to look for other, perhaps more interesting alternative content.
Lexical resources can be adjusted both in the Java API and in the REST API.