Carrot2 4.1.x

Release history for Carrot2 4.1.x and bugfix releases.

Version 4.1.0

This release changes the lexical data dictionary formats, adds ephemeral per-request dictionaries and introduces minor adjustments to Java and REST APIs.

New features

Carrot2 Workbench

Carrot2 Clustering Workbench has been rewritten as a browser-based application.

Carrot2 Clustering Workbench, light theme.
Carrot2 Clustering Workbench, dark theme.

You can use Workbench to cluster documents from local XML, JSON, Excel and CSV files, as well as Solr and Elasticsearch instances. A set of sliders is available to change clustering parameters in real time; you can also export the parameters JSON ready for pasting into REST API requests. Finally, you can export the clustering results as JSON or Excel spreadsheet.

#36
JSON dictionaries

Carrot2 word and label filtering dictionaries are now stored in the JSON format. This change adds more expressive matching modes, such as globs for simple phrase-level filtering or regular expressions for complete control of the filtering. Please refer to the dictionaries section for an in-depth overview of what's available.

As a follow-up, the plain-text dictionaries have been deprecated and the file naming convention for the default dictionary files has changed. A dictionary file conversion utility is available.

#51
Per-request dictionaries

Per-request (ephemeral) label and word filtering support has been added. This feature allows passing per-request word and cluster label filters to be applied in addition to the default language resources. See ephemeral dictionary section in the Java API and REST API sections for more information.

#44

API changes

Plain text dictionaries deprecated

As a follow-up to the JSON dictionaries new feature, the plain-text-based format has been deprecated.

File naming convention for default language resources has changed. For backward compatibility, if old resources can be found in the resource lookup location, they will still be used and a warning will be issued via Java logging system.

If you have language resources in the old format, please convert them to the JSON format. A simple utility is included in Carrot2 core JAR and can help with the conversion. Just run it with:

java -cp carrot2-core-4.1.0.jar org.carrot2.language.ConvertLegacyResources [dir]

Where dir points to a directory with old resources. New resources in their corresponding naming convention will be written alongside old resources. The old resource must be manually deleted once the conversion completes successfully.

#51
More details from /list method

The /service/list endpoint of the REST API now returns the language and algorithm for all of the available request templates.

The response format of the endpoint has changed. Previously, the templates element was a list of template names, now it will contain an object with template names as keys and template content as values, for example:

...
"templates" : {
  "english-lingo" : {
    "language" : "English",
    "algorithm" : "Lingo"
  },
  "stc" : {
    "algorithm" : "STC"
  }
}
#38
Lingo filter parameter change

Lingo algorithm's filter parameters have been changed from Booleans to proper objects with a dedicated enabled parameter. Unless you used these attributes explicitly, no action is needed.

#43
LexicalData interface split

The LexicalData interface (LanguageComponents component) has been split into two independent components: StopwordFilter and LabelFilter. The default implementations and abstract classes have been changed accordingly.

#45

Improvements

GZIP compression

REST API built-in server now supports GZIP compression.

#66
Request processing information

Added clustering and request processing time information to clustering response. This information is optional and is returned when the serviceInfoHTTP parameter is enabled on a clustering request.

#35
Java module system support improved

Improved support for the Java module system by providing the Automatic-Module-Name entry in JAR manifests.

#59

Bug fixes

Clustering of multi-value fields fails

Carrot2 4.0.x fails to cluster documents containing multi-value fields (array of strings). Version 4.1.0 fixes the issue.

#34

Infrastructure

Dependency updates

Third-party dependencies have been upgrade to their newest versions:

  • Apache Lucene: 8.7.0
  • Jetty: v20201120
  • Jackson: 2.12.0

#61 #33

IntelliJ experience

Only recompile code under IntelliJ IDEA to improve developer experience.

#37