REST API Basics
This article will walk you through the methods offered by the Carrot2 HTTP/REST service.
Carrot2 DCS (document clustering server) is a set of resources that expose much of the Java API as a stateless HTTP/REST service. The DCS contains the following:
- HTTP REST service endpoints (
/service
context) for document clustering, dynamic inspection of service components and an OpenAPI descriptor, - an example search frontend (
/frontend
context) for PubMed and a general-purpose meta search engine utilizing the service, - a few Java examples that make use of model classes and query the service,
- Jetty HTTP server,
- this documentation.
In the examples below we will refer to the DCS and the REST service interchangeably, although the service application context can be separated and deployed on any other web application container, such as Apache Tomcat.
Installation
The DCS is shipped with the Carrot2 distribution under the dcs
folder.
It requires java
to be installed on the host system and will bind to port
8080
by default. An alternative port can be selected by passing the
--port
option to the launch script, for example:
> dcs --port 8081
09:47:32: DCS context initialized [algorithms: [Bisecting K-Means, Lingo, STC], templates: [frontend-default, lingo, stc, bkmeans]]
09:47:32: Service started on port 8081.
Once started, the service is ready to accept requests, by default at http://localhost:8080/service/.
API workflow
The document clustering service is essentially a single, stateless endpoint accepting JSON requests and returning JSON responses.
A full clustering request is a JSON file containing the following elements:
- clustering algorithm identifier,
- language (language components) identifier,
- text fields of documents to be clustered.
Such a request file must be sent using HTTP POST method to the /cluster
service endpoint, which returns either a successful response containing clusters
or an error response with some additional
diagnostic information.
Note that the request contains elements that may require some a priori knowledge, such as
the clustering algorithm's identifier and the language. You can assume
that certain components, such as the Lingo
algorithm or the English
language, always exist in your DCS distribution. Alternatively, you can
enumerate the available components dynamically using the
/list
service endpoint.
Clustering
This section will go through a very basic example of a full request-response cycle.
Let's start by assembling the request JSON. We need to know the algorithm to be used
for clustering and the language in which our documents are written, so that an appropriate
preprocessing is applied to input text before clustering.
In this example we will use hardcoded values for the Lingo
algorithm
and the English
language.
Documents for clustering are composed of one or more fields, where each field is a pair consisting of an identifier (name of the field) and value (a string or an array of strings). You should limit input documents to just those fields that should be clustered. In this example we will have three documents, each with one field:
[
{ "field": "foo bar" },
{ "field": "bar" },
{ "field": "baz" }
]
We have everything we need to put together the entire request body:
{
"language": "English",
"algorithm": "Lingo",
"parameters": {
"preprocessing": {
"phraseDfThreshold": 1,
"wordDfThreshold": 1
}
},
"documents": [
{ "field": "foo bar" },
{ "field": "bar" },
{ "field": "baz" }
]
}
You probably noticed that there is one element in the above request we have not discussed yet:
the parameters
block. This block is used to alter Lingo's default
settings. Our document list is very tiny and we force the algorithm to produce
at least one group, so that we can see what it looks like in the response.
Assuming the DCS is running in the background, the clustering service's
default endpoint is at http://localhost:8080/service/cluster
.
We are ready to send the above JSON for clustering using a command-line tool, such as
curl
:
curl -X POST --header "Content-Type: text/json" --data-binary @cluster-request.json "http://localhost:8080/service/cluster?indent"
Note the MIME type for JSON must be properly set (Content-Type: text/json
). The
response received from the service should be something like this:
{
"clusters" : [
{
"labels" : [
"Bar"
],
"documents" : [
0,
1
],
"clusters" : [ ],
"score" : 0.09746237168594427
}
]
}
The response is a potentially recursive hierarchy of document clusters, where each cluster has the following properties:
- labels
- Cluster description label or labels.
- documents
- An array of references to documents contained in the cluster. Each reference is a 0-based index of the document within the clustering request.
- clusters
- An array of subclusters of this cluster (if the algorithm supports hierarchical clustering).
- score
- The cluster's quality score. The score is not normalized in any way but represents relative quality of each cluster within this request.
In the response above we see a single cluster of documents 0
and 1
,
labeled Bar.
Request and response models
While it is perfectly fine to assemble the request JSON by hand, the DCS distribution comes with data model Java classes that can be used to build request and parse responses. The example shown in the previous section can be expressed in Java code by the following snippet:
LingoClusteringAlgorithm algorithm = new LingoClusteringAlgorithm();
algorithm.preprocessing.phraseDfThreshold.set(1);
algorithm.preprocessing.wordDfThreshold.set(1);
ClusterRequest request = new ClusterRequest();
request.algorithm = LingoClusteringAlgorithm.NAME;
request.language = "English";
request.parameters = Attrs.extract(algorithm);
request.documents =
Stream.of("foo bar", "bar", "baz")
.map(
value -> {
ClusterRequest.Document doc = new ClusterRequest.Document();
doc.setField("field", value);
return doc;
})
.collect(Collectors.toList());
The request can be then serialized into JSON using the Jackson library. The DCS Java examples contain a few command-line applications that make extensive use of these model classes, please refer to them for details.
Alternatively, the OpenAPI descriptor can be used to generate service binding code for Java and many other languages.
Service configuration
The second service endpoint the DCS exposes is called /list
. When invoked
with a single HTTP GET (without any parameters), the service returns information
on the available algorithms and languages. An example response can
look like this:
{
"algorithms" : {
"Bisecting K-Means" : [
"Arabic",
"English",
...
],
"Lingo" : [
...
],
"STC" : [
...
]
},
"templates" : [
"frontend-default",
"lingo",
"stc",
"bkmeans"
]
}
Note that each algorithm has an associated list of language codes it supports.
The templates
block enumerates preconfigured request templates .
OpenAPI service descriptor
The DCS comes with an OpenAPI service specification descriptor, by default accessible at: http://localhost:8080/service/openapi/dcs.yaml This descriptor contains documentation and working examples for all service endpoints and parameters.
The DCS ships with three OpenAPI specification browsers: