The clustering plugin attempts to automatically group together similar "documents" and assign human-readable labels to these groups. The clusters can be thought-of as "dynamic facets" generated for each unique query and set of search result hits. Take a look at the Carrot2 demo page to see how this can be used in practice.
Each document passed for clustering is composed of several logical parts: the document's identifier, title, main content and language code (only the identifier field is mandatory, everything else is optional). Either the title, main content or both should be provided to make the clustering process reasonable, otherwise there will be no input for the algorithm to work on.
Read this section first, it contains important information about clustering which will help understand what's going on behind the scenes.
Documents indexed in Elasticsearch do not have to follow any predefined schema, so sections of the input JSON document need to be mapped to the logical layout required by the clustering plugin. An example mapping can look as illustrated in the figure below:
Two document fields are mapped to the title. This is not an error: any number of fields can be mapped to either the title or the content—values of those fields will be concatenated and used for clustering.
Another way to provide the title and content fields is to use the highlighter. This can be useful to decrease the amount of text passed to the clustering algorithm (improves performance) or to make the content of documents passed for clustering more focused on the query (because highlights typically provide the context of query terms).
The REST API examples below demonstrate the API for field mapping in details.
The Java API for clustering search results implements all the backend functionality behind all REST requests described in the following parts of this document. For actual code utilizing the API see the source code of the plugin at github, especially the unit and integration tests.
The HTTP REST API contains several methods closely reflecting the Java-side API's functionality. Each of these methods is described in detail below.
/_algorithms
(GET
or POST
)This action lists all available clustering algorithms. The returned identifiers can be used as a parameter to the clustering request.
A request to list the available algorithms is a simple GET
or POST
request to
/_algorithms
URL.
The response is a JSON object with an algorithms
property which is a non-empty
array of algorithm identifiers. The following example shows the algorithms available
on this plugin instance. The default algorithm is the first one on the list.
$.get(window.ES_URL + "/_algorithms", function(response) { $("#list-of-algorithms").text( JSON.stringify(response, null, " ")); });
/_search_with_clusters
(POST
, GET
)/{index}/_search_with_clusters
(POST
, GET
)/{index}/{type}/_search_with_clusters
(POST
, GET
)This action performs a search query, fetches matching hits and clusters them on-the-fly.
The index
and type
URI segments implicitly bind the search request part of the
message to a given index and document type, exactly as in the
search request API.
A clustering request is a HTTP REST request, where the full set of parameters is supported via HTTP POST request with a JSON body. A limited subset of clustering functionality is also exposed via the HTTP GET method (and URL parameters).
A HTTP POST request should contain a JSON object with the following properties.
search_request
required The search request to fetch documents to be clustered. This section follows exactly what the search DSL specifies, including all optional bells and whistles such as sorting, filtering, query DSL, highlighter, etc.
query_hint
required This attribute specifies query terms used to fetch the matching documents. This hint helps the clustering algorithm avoid trivial clusters that would be created around those frequent query terms. Typically the value of this parameter is identical to what the user typed in the search box. If possible, it should be pruned from any boolean or search-engine specific operators which could affect the clustering process. The query hint is obligatory but may be an empty string.
field_mapping
required
Defines how to map actual fields of the documents matching the search_request
to
logical fields of the documents to be clustered. The value should be an object where keys indicate logical
document fields and values are arrays with field source specifications (content of fields
defined by these specifications is concatenated). For example this is a valid field mapping specification:
{ "title": [_source.subject], "content": [_source.abstract, highlight.main], "language": [_source.lang] }
Any of the following logical document field names are valid:
title
The title of the document.
content
The main body (content) of the document.
language
Optional language name for the title and content of a document. The language must be
consistent with language components installed and available in Carrot2. The
list of supported languages is logged at startup. Note that unsupported languages will
default to the main language of the request (language
request parameter).
The field source specification defines where the value is taken from: the search hit's fields, stored document's content, or from the highlighter's output. The syntax of field source specification is as follows:
_source.{fieldname}
highlight.{fieldname}
fields.{fieldname}
_source
syntax.
algorithm
optional Defines which clustering algorithm should be used for clustering. Names of all built-in clustering algorithms are logged at startup and are also returned from the list algorithms request. The default algorithm is used, if value is empty or not provided.
max_hits
optional
If set to a non-negative number, the clustering response will be limited to
contain only a maximum of the given search hits. The clustering will still run
on a full window of results returned by the original search request. This option
may be useful to decrease the size of clustering response; in an extreme case of
max_hits
equal o 0, only clusters are returned
and they can be used as facets (for refining the query, but without the immediate
link to the search hits).
Note that clusters may still reference documents not present in the returned (trimmed) hits window.
create_ungrouped
optional
If set to true
, the documents not assigned to any cluster will form a synthetic
group appended at the end of all clusters. This is effectively the "other topics" group
created by Carrot2 versions prior to 4.0.
language
optional
Declares the default language for documents where field mapping for the logical
language
field is not declared (or empty). The default value of this field
is English
.
attributes
optional A JSON object with attributes overriding the default algorithm settings (per-query or runtime attributes in Carrot2 parlance).
Very important
Clustering requires at least a few dozen documents (hits) in order to make
sense. The clustering plugin clusters search results only (it does not look in the index, it does not
fetch additional documents). Make sure to specify the size
of the
fetch window to be at least 100 documents. If your application does not need so many hits
(document references), the response size can be trimmed by using max_hits
parameter on
the clustering request.
A HTTP GET clustering request supports a superset of HTTP URI parameters defined by Elasticsearch's URI search request. All additional parameters correspond to those typically defined in the body of a clustering request sent via HTTP POST. Namely, the following parameters are supported by HTTP GET:
field_mapping_*
required
This is a wildcard (a family) of parameters, each of which defines a logical field mapping, similar
to field_mapping
map described in the HTTP POST request. A field_mapping_title
will specify the logical title's mapping, wheareas field_mapping_content
will specify
the logical document content mapping.
The value of the mapping parameter is a comma-separated list of mapping specifications, as described in the description of the POST request.
algorithm
optional
Identical semantics to algorithm
attribute described in HTTP POST request.
query_hint
optional
Identical semantics to query_hint
attribute described in HTTP POST request.
For GET requests the query hint is optional; if not present, the q
attribute is used
as the default.
create_ungrouped
optional
If set to true
, the documents not assigned to any cluster will form a synthetic
group appended at the end of all clusters. This is effectively the "other topics" group
created by Carrot2 versions prior to 4.0.
language
optional
Declares the default language for documents where field mapping for the logical
language
field is not declared (or empty). The default value of this field
is English
.
Important
A HTTP GET request offers a subset of the functionality of a full HTTP POST JSON syntax. For example, it is not possible to specify a field mapping to highlighted field values, define custom algorithm attributes, etc. HTTP POST is recommended for production.
An example HTTP GET clustering request is shown below, with the resulting clusters shown on the right-hand side panel.
var getUrl = window.ES_URL + "/test/test/_search_with_clusters?" + "q=data+mining&" + "size=100&" + "field_mapping_title=_source.title&" + "field_mapping_content=_source.content"; // Run HTTP GET via jquery and render cluster labels. $.get(getUrl, function(response) { $("#cluster-httpget-result").text( dumpClusters([], response.clusters).join("\n")); });
The response format is identical to a plain search request response, with extra properties presented in the schematic output below.
{ /* Typical search response fields. */ "hits": { /* ... */ }, /* Clustering response fields. */ "clusters": [ /* Each cluster is defined by the following. */ { "id": /* identifier */, "score": /* numeric score */, "label": /* primary cluster label */, "other_topics": /* if present, and true, this cluster groups unrelated documents (no related topics) */, "phrases": [ /* cluster label array, will include primary. */ ], "documents": [ /* This cluster's document ID references. May be undefined if this cluster holds sub-clusters only. */ ], "clusters": [ /* This cluster's subclusters (recursive objects of the same structure). May be undefined if this cluster holds documents only. */ ], }, /* ...more clusters */ ], "info": { /* Additional information about the clustering: execution times, the algorithm used, etc. */ } }
Given the following function that recursively dumps clusters:
window.dumpClusters = function(arr, clusters, indent) { indent = indent ? indent : ""; clusters.forEach(function(cluster) { arr.push( indent + cluster.label + (cluster.documents ? " [" + cluster.documents.length + " documents]" : "") + (cluster.clusters ? " [" + cluster.clusters.length + " subclusters]" : "")); if (cluster.clusters) { dumpClusters(arr, cluster.clusters, indent + " "); } }); return arr; }
We can dump all cluster labels of a clustering request with the following snippet of javascript:
var request = { "search_request": { "query": {"match" : { "content": "data mining" }}, "size": 100 }, "max_hits": 0, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] } }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#cluster-list-result").text( dumpClusters([], response.clusters).join("\n")); });
The output will vary depending on the choice of clustering algorithm
(and particular documents that made it to the hit list if search is not
deterministic). The following example shows the same result as above
but uses the STC
algorithm. We don't need every search hit here
so we will omit them in the response.
var request = { "search_request": { "query": {"match" : { "content": "data mining" }}, "size": 100 }, "max_hits": 0, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "STC" }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#cluster-list-result2").text( dumpClusters([], response.clusters).join("\n")); });
A full response for a clustering request can look as shown below (note the difference in field mapping in this example).
var request = { "search_request": { "_source": [ "title", "content" ], "query": {"match" : { "content": "data mining" }}, "size": 100 }, "max_hits": 0, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] } }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#simple-request-result").text( JSON.stringify(response, false, " ")); });
The field mapping section provides a connection between actual data and logical data
to cluster on. The different field mapping sources
(_source.*
,
highlight.*
and
fields.*
) can be used to tune the amount of data returned in the request and
the amount of text passed to the clustering engine (and in result the required processing cost).
The _source.*
mapping takes data directly from the source document, if _source
is available as part of the search hit. The content pointed to by this mapping is not returned as
part of the request, it is only used internally for clustering.
Important!
Partial _source
can be configured using
search request filtering attributes.
The fields.*
mapping requires stored fields in the schema. It is superseded by
_source.*
filtering since ES 5.x release.
The highlight.*
mapping also must be accompanied by appropriate
highlight
declaration in the search request. The highlighting request specification can be used to tune the
amount of content passed to the clustering engine (the number of fragments, their width, boundary, etc.). This
is of particular importance when the documents are long (full content is stored): it is typical that
clustering algorithms run perceptually "better" when focused on the context surrounding
the query, rather than when presented with full content of all documents.
Any highlighted content will also be returned as part of the request.
Compare the clustering output for the following two requests. One is clustering on full
content of the content
fields, the other just on highlighted snippets
around the mining
term from the query (highlights come from the
same content
field).
var request = { "search_request": { "_source": ["url", "title", "content"], "query": { "query_string" : { "default_field": "content", "query": "mining" } }, "size": 100 }, "query_hint": "mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] } }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#fields-request").text(JSON.stringify(response, false, " ")); $("#fields-request-clusters").text(dumpClusters([], response.clusters).join("\n")); });
var request = { "search_request": { "_source": ["url", "title"], "query": { "query_string" : { "default_field": "content", "query": "mining" } }, "size": 100, "highlight" : { "pre_tags" : ["", ""], "post_tags" : ["", ""], "fields" : { "content" : { "fragment_size" : 50, "number_of_fragments" : 2 } } }, }, "query_hint": "mining", "field_mapping": { "title": ["_source.title"], "content": ["highlight.content"] } }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#highlight-request").text(JSON.stringify(response, false, " ")); $("#highlight-request-clusters").text(dumpClusters([], response.clusters).join("\n")); });
The clustering plugin comes with several open-source algorithms from the Carrot2 project. An extension plugin must be installed separately to add support for the commercial Lingo3G clustering algorithm.
The question of which algorithm to choose depends on the amount of traffic (STC is faster than Lingo, but arguably produces less intuitive clusters, Lingo3G is the fastest algorithm but is not free or open source), expected result (Lingo3G provides hierarchical clusters, Lingo and STC provide flat clusters), and the input data (each algorithm will cluster the input slightly differently). There is no single answer as to which algorithm is "the best".
Compare the clusters dumped for the following identical search request.
var request = { "search_request": { "query": {"match" : { "content": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "Lingo" }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#request-algorithm1").text(dumpClusters([], response.clusters).join("\n")); });
var request = { "search_request": { "query": {"match" : { "content": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "STC" }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#request-algorithm2").text(dumpClusters([], response.clusters).join("\n")); });
By default algorithms run with their default settings. There is currently no way of preconfiguring algorithm defaults. Use per-request overrides.
Clustering algorithms come with many attributes that modify their runtime behavior. If desired, certain attributes can be modified per-request. The following example shows how to modify the number of desired clusters randomly (execute the example a few times to see the difference).
var request = { "search_request": { "query": {"match" : { "content": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "Lingo", "attributes": { "desiredClusterCount": Math.round(5 + Math.random() * 5) } }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#request-attributes").text(dumpClusters([], response.clusters).join("\n")); });
Each clustering algorithm will have a different set of parameters. Refere to Carrot2 releases as they come with extensive documentation of those parameters.
The field mapping specification can include a language
element, which
defines the name of the Carrot2 language component
to use for clustering. This information can be stored in the index based on apriori knowledge of
the documents' source or a language detection filter applied at indexing time.
The language hint makes it easier for clustering algorithms to separate documents
from different languages on input and to pick the right language resources for clustering.
If you do have multi-lingual query results (or query results in a language different
than English), it is strongly advised to map the language
field
appropriately.
Names of all supported languages are printed at clustering plugin's startup.
Documents with language codes not available in Carrot2 will default to the
language
parameter of the entire request.
The following example demonstrates clustering of documents in multiple languages. Some documents are in German, some are in English. Note the default language is set to "Italian" (but each document declares its own corresponding language so it does not matter).
var request = { "search_request": { "query": {"match_all" : {}}, "size": 100 }, "query_hint": "bundestag", "algorithm": "Lingo", "language": "Italian", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"], "language": ["_source.lang"] }, "attributes": { } }; $.post(window.ES_URL + "/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#language-fieldmapping").text(dumpClusters([], response.clusters).join("\n")); });
The plugin comes with sensible defaults out of the box and should require no additional configuration. Customize only if really necessary.
The following configuration properties can be tweaked at the global ES configuration level.
{es.home}/config/elasticsearch.yml
,{es.home}/config/elasticsearch.json
,{es.home}/config/elasticsearch.properties
The main ES configuration file can be used to enable/ disable the plugin and to tweak the resources assigned to clustering requests.
carrot2.enabled
false
disables the plugin, even if it is installed.threadpool.search.*
The following configuration files and properties can be found inside the
plugin's ZIP file (elasticsearch/config/
folder) or
under {es.home}/config/elasticsearch-carrot2/
after
plugin installation).
carrot2.yml
,carrot2.json
,carrot2.properties
The master configuration file for the plugin.
resources
Resource lookup path for loading language component and other resources.
Note that if this option is defined, all required resources must be present under this location. No other location will be scanned (no defaults from classpath).
If Lingo3G extension plugin is installed and the algorithm is available, the license needs to be installed at any of the following locations.
{es.home}/config/license*.xml
,{plugin.zip}/elasticsearch/config/license*.xml