REST API Basics

This article will walk you through the methods offered by the Carrot2 HTTP/REST service.

Carrot2 DCS (document clustering server) is a set of resources that expose much of the Java API as a stateless HTTP/REST service. The DCS contains the following:

  • HTTP REST service endpoints (/service context) for document clustering, dynamic inspection of service components and an OpenAPI descriptor,
  • an example search frontend (/frontend context) for PubMed and a general-purpose meta search engine utilizing the service,
  • a few Java examples that make use of model classes and query the service,
  • Jetty HTTP server,
  • this documentation.

In the examples below we will refer to the DCS and the REST service interchangeably, although the service application context can be separated and deployed on any other web application container, such as Apache Tomcat.

Installation

The DCS is shipped with the Carrot2 distribution under the dcs folder. It requires java to be installed on the host system and will bind to port 8080 by default. An alternative port can be selected by passing the --port option to the launch script, for example:

> dcs --port 8081
09:47:32: DCS context initialized [algorithms: [Bisecting K-Means, Lingo, STC], templates: [frontend-default, lingo, stc, bkmeans]]
09:47:32: Service started on port 8081.

Once started, the service is ready to accept requests, by default at http://localhost:8080/service/.

API workflow

The document clustering service is essentially a single, stateless endpoint accepting JSON requests and returning JSON responses.

A full clustering request is a JSON file containing the following elements:

  • clustering algorithm identifier,
  • language (language components) identifier,
  • text fields of documents to be clustered.

Such a request file must be sent using HTTP POST method to the /cluster service endpoint, which returns either a successful response containing clusters or an error response with some additional diagnostic information.

Note that the request contains elements that may require some a priori knowledge, such as the clustering algorithm's identifier and the language. You can assume that certain components, such as the Lingo algorithm or the English language, always exist in your DCS distribution. Alternatively, you can enumerate the available components dynamically using the /list service endpoint.

Clustering

This section will go through a very basic example of a full request-response cycle.

Let's start by assembling the request JSON. We need to know the algorithm to be used for clustering and the language in which our documents are written, so that an appropriate preprocessing is applied to input text before clustering. In this example we will use hardcoded values for the Lingo algorithm and the English language.

Documents for clustering are composed of one or more fields, where each field is a pair consisting of an identifier (name of the field) and value (a string or an array of strings). You should limit input documents to just those fields that should be clustered. In this example we will have three documents, each with one field:

[
  { "field": "foo bar" },
  { "field": "bar" },
  { "field": "baz" }
]

We have everything we need to put together the entire request body:

{
  "language": "English",
  "algorithm": "Lingo",
  "parameters": {
    "preprocessing": {
      "phraseDfThreshold": 1,
      "wordDfThreshold": 1
    }
  },
  "documents": [
    { "field": "foo bar" },
    { "field": "bar" },
    { "field": "baz" }
  ]
}

You probably noticed that there is one element in the above request we have not discussed yet: the parameters block. This block is used to alter Lingo's default settings. Our document list is very tiny and we force the algorithm to produce at least one group, so that we can see what it looks like in the response.

Assuming the DCS is running in the background, the clustering service's default endpoint is at http://localhost:8080/service/cluster. We are ready to send the above JSON for clustering using a command-line tool, such as curl:

curl -X POST --header "Content-Type: text/json" --data-binary @cluster-request.json "http://localhost:8080/service/cluster?indent"

Note the MIME type for JSON must be properly set (Content-Type: text/json). The response received from the service should be something like this:

{
  "clusters" : [
    {
      "labels" : [
        "Bar"
      ],
      "documents" : [
        0,
        1
      ],
      "clusters" : [ ],
      "score" : 0.09746237168594427
    }
  ]
}

The response is a potentially recursive hierarchy of document clusters, where each cluster has the following properties:

labels
Cluster description label or labels.
documents
An array of references to documents contained in the cluster. Each reference is a 0-based index of the document within the clustering request.
clusters
An array of subclusters of this cluster (if the algorithm supports hierarchical clustering).
score
The cluster's quality score. The score is not normalized in any way but represents relative quality of each cluster within this request.

In the response above we see a single cluster of documents 0 and 1, labeled Bar.

Request and response models

While it is perfectly fine to assemble the request JSON by hand, the DCS distribution comes with data model Java classes that can be used to build request and parse responses. The example shown in the previous section can be expressed in Java code by the following snippet:

LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
algorithm.preprocessing.phraseDfThreshold.set(1);
algorithm.preprocessing.wordDfThreshold.set(1);

ClusterRequest request = new ClusterRequest();
request.algorithm = LingoClusteringAlgorithm.NAME;
request.language = "English";
request.parameters = Attrs.extract(algorithm);
request.documents =
    Stream.of("foo bar", "bar", "baz")
        .map(
            value -> {
              ClusterRequest.Document doc = new ClusterRequest.Document();
              doc.setField("field", value);
              return doc;
            })
        .collect(Collectors.toList());

The request can be then serialized into JSON using the Jackson library. The DCS Java examples contain a few command-line applications that make extensive use of these model classes, please refer to them for details.

Alternatively, the OpenAPI descriptor can be used to generate service binding code for Java and many other languages.

Service configuration

The second service endpoint the DCS exposes is called /list. When invoked with a single HTTP GET (without any parameters), the service returns information on the available algorithms and languages. An example response can look like this:

{
  "algorithms" : {
    "Bisecting K-Means" : [
      "Arabic",
      "English",
      ...
    ],
    "Lingo" : [
      ...
    ],
    "STC" : [
      ...
    ]
  },
  "templates" : [
    "frontend-default",
    "lingo",
    "stc",
    "bkmeans"
  ]
}

Note that each algorithm has an associated list of language codes it supports. The templates block enumerates preconfigured request templates .

OpenAPI service descriptor

The DCS comes with an OpenAPI service specification descriptor, by default accessible at: http://localhost:8080/service/openapi/dcs.yaml This descriptor contains documentation and working examples for all service endpoints and parameters.

The DCS ships with three OpenAPI specification browsers:

RapiDoc's representation of DCS's OpenAPI descriptor.

RapiDoc's representation of DCS's OpenAPI descriptor.