Carrot2

User and Developer Manual

for version 3.17.0-SNAPSHOT

Abstract

This document serves as documentation for the Carrot2 framework. It describes Carrot2 application suite and the API developers can use to integrate Carrot2 clustering algorithms into their code. It also provides a reference of all Carrot2 components and their attributes.

Carrot2 Online Demo: http://search.carrot2.org
Carrot2 website: http://project.carrot2.org


Table of Contents

1. Introduction
2. FAQ
2.1. Is Carrot2 suitable for me?
2.2. How do I use Carrot2?
2.3. How can I improve clustering?
3. Tools and APIs
3.1. Carrot2 Document Clustering Workbench
3.2. Carrot2 Java API
3.3. Carrot2 C# API
3.4. Carrot2 Document Clustering Server
3.5. Carrot2 Web Application
3.6. Carrot2 Command Line Interface
3.7. Carrot2 clustering in Apache Solr
3.8. Carrot2 clustering in ElasticSearch
4. Getting started
4.1. Requirements
4.2. Trying Carrot2 clustering
4.2.1. Clustering results from common search engines
4.2.2. Clustering documents from XML files
4.2.3. Clustering documents from XML feeds
4.2.4. Clustering documents from a Lucene index
4.2.5. Clustering documents from a Solr index
4.2.6. Saving documents or clusters for further processing
4.3. Integrating Carrot2 with your software
4.3.1. Compiling a Java program using Carrot2 API
4.3.2. Adding Carrot2 dependency to a Maven2 project
4.3.3. Setting up a Carrot2 project in Eclipse IDE
4.3.4. Setting up Carrot2 source code in Eclipse IDE
4.3.5. Compiling a C# program using Carrot2 API
4.3.6. Calling Carrot2 clustering from non-Java software
4.3.7. Java Dependencies
5. Tuning clustering
5.1. Desirable characteristics of documents for clustering
5.2. Choosing the clustering algorithm
5.3. Tuning clustering in Carrot2 Document Clustering Workbench
5.4. Modifying the list of stop words
5.5. Excluding specific clusters from results
5.6. Reducing the size of the Other Topics cluster
5.7. Improving clustering performance
5.7.1. Improving performance of Lingo
5.7.2. Improving performance of STC
5.8. Benchmarking clustering performance
6. Lexical resources
6.1. Location of lexical resources
6.2. Tuning lexical resources in Carrot2 Document Clustering Workbench
6.3. Stop word files
6.4. Label filtering files
7. Customization
7.1. Component suites and attributes
7.1.1. Component suites
7.1.2. Component attributes
7.2. Adding document sources to Carrot2 Web Application
7.3. Adding document sources to Carrot2 Document Clustering Server
7.4. Customizing Lingo for Carrot2 Web Application
7.5. Customizing Lingo for Carrot2 Document Clustering Server
7.6. Customizing Lingo for Carrot2 Command Line Interface
7.7. Customizing Lingo in Carrot2 Java API
7.8. Adding document sources to Carrot2 Document Clustering Workbench
8. Advanced topics
8.1. Running Carrot2 in Eclipse IDE
8.1.1. Running Carrot2 Document Clustering Workbench in Eclipse IDE
8.1.2. Running Carrot2 Web Application in Eclipse IDE
8.2. Building Carrot2 from source code
8.2.1. Building Carrot2 Document Clustering Workbench
8.2.2. Building Carrot2 Web Application
8.3. Using Carrot2 Document Clustering Server with curl
8.4. Working with HTTP proxies
8.5. HTTP BASIC or DIGEST authentication
9. Troubleshooting
9.1. Troubleshooting Carrot2 Document Clustering Workbench
9.1.1. Increasing memory size
9.1.2. Getting exception stack trace
9.2. Troubleshooting Carrot2 Web Application
9.2.1. "?" characters instead of Unicode special characters
10. Architecture and API
10.1. Carrot2 architecture overview
10.1.1. Processing component pipeline
10.1.2. Processing component attributes
10.2. Carrot2 XML data formats
10.2.1. Carrot2 input XML format
10.2.2. Carrot2 output XML format
10.3. Carrot2 JSON data format
10.3.1. Carrot2 output JSON format
11. Carrot2 Development
11.1. Stable release procedure
11.2. Versioning scheme
11.3. QA check list
12. Component reference
12.1. By Source Clustering
12.2. By URL Clustering
12.3. Bisecting k-means
12.3.11. Ungrouped
12.4. Lingo Clustering
12.4.12. Ungrouped
12.5. Suffix Tree Clustering
12.5.13. Ungrouped
12.6. eTools Metasearch Engine
12.7. Bing Web Search
12.8. Bing News Search
12.9. PubMed medical database
12.10. XML
12.11. Lucene Document Source
12.12. Solr Search Engine
12.13. Open Search
12.14. IDOL Search
12.15. Ambient Test Set
12.16. ODP239 Test Set

List of Figures

3.1. Carrot2 Document Clustering Workbench screenshot
3.2. Carrot2 Document Clustering Server quick start screen
3.3. Carrot2 Web Application results screen
4.1. Carrot2 Document Clustering Workbench XML search view
4.2. News feed XML to Carrot2 format transformation
4.3. Document attribute that contains a list of values.
4.4. Carrot2 Document Clustering Workbench Lucene search view
4.5. Carrot2 Document Clustering Workbench Solr search view
4.6. Setting up Carrot2 Java API in Eclipse IDE
4.7. Eclipse IDE Carrot2 project import step 1
4.8. Eclipse IDE Carrot2 project import step 2
5.1. Lingo and STC clusters for the 'data mining' search results
5.2. Tuning clustering in Carrot2 Document Clustering Workbench
5.3. Attributes view's context menu
5.4. Carrot2 Document Clustering Workbench Benchmark view
6.1. Preprocessing attributes section
6.2. Carrot2 Document Clustering Workbench restart clustering button
7.1. Example Carrot2 component suite
7.2. Example Carrot2 attribute set
8.1. Workbench Run Configuration
8.2. Using DCS and curl to cluster data from document source
8.3. Using DCS and curl to cluster data from document source
9.1. Carrot2 Document Clustering Workbench error dialog
9.2. Carrot2 Document Clustering Workbench Show View dialog
9.3. Carrot2 Document Clustering Workbench Error Log view
9.4. Carrot2 Document Clustering Workbench Event Details dialog
10.1. Carrot2 input XML format
10.2. Carrot2 output XML format
10.3. Carrot2 output JSON format

List of Tables

5.1. Characteristics of Lingo and STC clustering algorithms
5.2. Optimum usage scenarios for Lingo and STC

List of Examples

6.1. A sample stop word file for English: stopwords.en
6.2. A sample stop label file for English: stoplabels.en

1 Introduction

What is Carrot2 and what it is not

Carrot2 is a library and a set of supporting applications you can use to build a search results clustering engine. Such an engine will organize your search results into topics, fully automatically and without external kowledge such as taxonomies or preclassified content.

Carrot2 contains two document clustering algorighms designed specifically for search results clustering: Suffix Tree Clustering and Lingo. Carrot2 offers components for fetching data from search engines that provide the required APIs (for example Microsoft Bing or PubMed), as well as other sources of documents like Lucene, Apache Solr or ElasticSearch.

Carrot2 is not a search engine itself, it does not have a crawler and indexer. There is a number of Open Source projects you can use to crawl (Nutch), index and search (Lucene, Solr) your content, which can then be queried and clustered by Carrot2

In most cases your workflow with Carrot2 applications would be the following:

  1. Use Carrot2 Document Clustering Workbench and possibly other applications from Carrot2 application suite to see what the clustering results are like for your content. If the results are promising, you can use the Carrot2 Document Clustering Workbench to further tune the clustering algorithm's settings.

  2. If you are developing Java software, use Carrot2 API and JAR to integrate clustering into your code. For non-Java environments, set-up the Carrot2 Document Clustering Server and call Carrot2 clustering using the REST protocol.

Chapter 2 answers the questions most frequently asked on Carrot2 mailing lists, it can also serve as a question-based index to the rest of this manual. Chapter 3 introduces applications available in Carrot2 distribution and Chapter 4 shows how to quickly set up Carrot2 to cluster your own data. Chapter 5 discusses topics related to tuning Carrot2 clustering, while Chapter 7 shows how to customize Carrot2 applications. Chapter 8 covers some more advanced use cases of Carrot2 and Chapter 9 provides solutions to common problems. Finally, Chapter 10 discusses Carrot2 architecture and internals, while Chapter 12 is an in-depth reference of Carrot2 components.

2 FAQ

Frequently Asked Questions

This chapter answers the questions most frequently asked on Carrot2 mailing lists. As it extensively links to further sections of the manual, it can also be treated as some sort question-based index for this manual.

2.1 Is Carrot2 suitable for me?

Can I use Carrot2 in a commercial project?
How can I acknowledge the use of Carrot2 on my site?
Can Carrot2 crawl my website?
Can I use Carrot2 to cluster something else than search results?
How does Carrot2 clustering scale with respect to the number and length of documents?
Can I force Carrot2 to cluster my documents to some predefined clusters / labels?
Can Carrot2 cluster content in other languages than English?

Can I use Carrot2 in a commercial project?

Yes. The only requirement is that you properly acknowledge the use of Carrot2 (on your project's website and documentation) and let us know about your project. Please also remember to read the license.

How can I acknowledge the use of Carrot2 on my site?

Please put a statement equivalent to This product includes software developed by the Carrot2 Project on your site and link it to Carrot2's website (http://www.carrot2.org). Additionally, you can use some of our powered-by logos if you like.

Can Carrot2 crawl my website?

No. Carrot2 can add clustering of search results to an existing search engine. You can use other open source projects (like Nutch or Heritrix) to crawl your website.

Can I use Carrot2 to cluster something else than search results?

Absolutely. Carrot2 came about as a framework for building search results clustering engines but its algorithms should successfully cluster up to about a thousand text documents, a few paragraphs each.

How does Carrot2 clustering scale with respect to the number and length of documents?

The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, depending on the algorithm, Carrot2 should successfully deal with up to a few thousands of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

Can I force Carrot2 to cluster my documents to some predefined clusters / labels?

No. Assigning documents to a set of predefined categories is a problem called text classification / categorization and Carrot2 was not designed to solve it. For text classification components you may want to see the LingPipe project.

Can Carrot2 cluster content in other languages than English?

Yes. Currently, Carrot2 can cluster content in 19 languages:

  • Arabic (experimental)
  • Chinese Simplified (experimental)
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Korean
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

Please note, however, that for some of the languages you may need to tune the stop words to achieve best results.

2.2 How do I use Carrot2?

What is the query syntax in Carrot2?
Which Carrot2 clustering algorithm is the best?
Does Carrot2 support boolean querying?

What is the query syntax in Carrot2?

As Carrot2 is not a search engine on its own, there is no common query syntax in Carrot2. The syntax depends on the underlying search engine you set Carrot2 to use, e.g. Bing, Solr, Lucene or any other. Carrot2 passes your query without any modifications to the search engine and clusters the results it returns. For this reason, any syntax supported by the search engine is automatically supported in Carrot2.

Which Carrot2 clustering algorithm is the best?

There is no one clear answer to this question. The choice of the algorithm depends on the input data and the desired characteristics of clusters. Please see Section 5.2 for some guidelines.

Does Carrot2 support boolean querying?

If the underlying search engine support boolean queries, so will Carrot2. Please see this question for more details.

2.3 How can I improve clustering?

What is the most suitable content for clustering in Carrot2?
How can I remove meaningless cluster labels?
How can I improve the performance of Carrot2?

What is the most suitable content for clustering in Carrot2?

Please see Section 5.1 for the answer.

How can I remove meaningless cluster labels?

Occasionally, Carrot2 may create meaningless cluster labels like read or site. Please see Section 5.5 for information on how to remove them.

How can I improve the performance of Carrot2?

Please see Section 5.7 for some clustering performance tips.

3 Tools and APIs

Carrot2 distribution suite

Carrot2 comes with a suite of tools and APIs that you can use to quickly set up clustering on your own data, tune clustering results, call Carrot2 clustering from your Java or C# code or access Carrot2 clustering as a remote service.

Carrot2 distribution contains the following elements:

  • Carrot2 Document Clustering Workbench  which is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data,

  • Carrot2 Java API  for calling Carrot2 document clustering from your Java code,

  • Carrot2 C# API  for calling Carrot2 document clustering from your C# or .NET code,

  • Carrot2 Document Clustering Server  which exposes Carrot2 clustering as a REST service,

  • Carrot2 Command Line Interface  applications which allow invoking Carrot2 clustering from command line,

  • Carrot2 Web Application  which exposes Carrot2 clustering as a web application for end users.

3.1 Carrot2 Document Clustering Workbench

Carrot2 Document Clustering Workbench is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data.

You can use Carrot2 Document Clustering Workbench to:

  • Quickly test Carrot2 clustering with your own data. Please see Chapter 4 for instructions for the most common scenarios.

  • Fine tune Carrot2 clustering algorithms' settings to work best with your specific data. Please see Chapter 5 for more details.

  • Run simple performance benchmarks using different settings to predict maximum clustering throughput on a single machine. Please see Section 5.8 for details.

Carrot2 Document Clustering Workbench features include:

  • Various document sources included.  Carrot2 Document Clustering Workbench can fetch and cluster documents from a number of sources, including major search engines, indexing engines (Lucene, Solr) as well as generic XML feeds and files.

  • Live tuning of clustering algorithm attributes.  Carrot2 Document Clustering Workbench enables modifying clustering algorithm's attributes and observing the results in real time.

  • Performance benchmarking.  Carrot2 Document Clustering Workbench can run simple performance benchmarks of Carrot2 clustering algorithms.

  • Attractive visualizations.  Carrot2 Document Clustering Workbench comes with two visualizations of the cluster structure, one developed within the Carrot2 project and another one from Aduna Software.

  • Modular architecture and extendability.  Carrot2 Document Clustering Workbench is based on Eclipse Rich Client Platform, which makes it easily extendable.

Figure 3.1 Carrot2 Document Clustering Workbench screenshot

Carrot2 Document Clustering Workbench screenshot

3.1.1 Installation and running

To run Carrot2 Document Clustering Workbench:

  1. Download and install Java Runtime Environment (version 1.8 or newer) if you have not done so.

  2. Download Carrot2 Document Clustering Workbench Windows binaries or Linux binaries and extract the archive to some local disk location.

  3. Run carrot2-workbench.exe (Windows) or carrot2-workbench (Linux).

3.2 Carrot2 Java API

The Carrot2 Java API package contains Carrot2 JAR files along with all dependencies, JavaDoc API reference and Java code examples. You can use this package to integrate Carrot2 clustering into your Java software. Please see Section 4.3.1 and Section 4.3.3 for instructions.

3.3 Carrot2 C# API

The Carrot2 C# API package contains all DLL libraries required to run Carrot2, C# API reference and code examples. You can use this package to integrate Carrot2 clustering into your C# / .NET software. Please see Section 4.3.5 for instructions.

3.4 Carrot2 Document Clustering Server

Carrot2 Document Clustering Server (DCS) exposes Carrot2 clustering as a REST service. It can cluster documents from an external source (e.g. a search engine) or documents provided directly as an XML stream and returns results in XML or JSON formats.

You can use Carrot2 Document Clustering Server to:

  • Integrate Carrot2 with your non-Java software.

  • Build a high-throughput document clustering system by setting up a number of load-balanced instances of the DCS.

Carrot2 Document Clustering Server features include:

  • XML and JSON response formats.  Carrot2 Document Clustering Server can return results both in XML and JSON formats. JSON-P (with callback) is also supported.

  • Various document sources included.  Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr).

  • Direct XML feed.  Carrot2 Document Clustering Server can cluster documents fed directly in a simple XML format.

  • PHP and C# examples included.  Carrot2 Document Clustering Server ships with ready-to-use examples of calling Carrot2 DCS services from PHP (version 5), C#, Ruby, Java and curl.

  • Quick start screen.  A simple quick start screen will let you make your first DCS request straight from your browser.

Figure 3.2 Carrot2 Document Clustering Server quick start screen

Carrot2 Document Clustering Server quick start screen

3.4.1 Installation and running

To run Carrot2 Document Clustering Server:

  1. Download and install Java Runtime Environment (version 1.8.0 or newer) if you have not done so.

  2. Download Carrot2 Document Clustering Server binaries and extract the archive to some local disk location.

  3. Run dcs.cmd (Windows) or dcs.sh (Linux).

  4. Point your browser to http://localhost:8080 for further instructions.

  5. See the examples/ directory in the distribution archive for PHP, C#, Ruby and Java code examples. You can also invoke DCS clustering using the curl command.

Tip

If you need to start the DCS at a port different than 8080, you can use the -port option:

dcs -port 9090

Tip

To deploy the DCS in an external servlet container, such as Apache Tomcat, use the carrot2-dcs.war file from the war/ folder of the DCS distribution.

3.5 Carrot2 Web Application

Carrot2 Web Application exposes Carrot2 clustering as a web application for end users. It allows users to browse clusters using a conventional tree view, but also in an attractive visualization.

Carrot2 Document Clustering Server features include:

  • Two cluster views.  Carrot2 Web Application offers two views of the clusters generated by Carrot2: conventional tree view and spatial visualizations.

  • All Carrot2 document sources and algorithms included.  Carrot2 Web Application contains a large number of document sources, including major search engines. Optionally, further document sources can be added, such as Lucene or Solr ones. It also contains all Carrot2's clustering algorithms.

  • XSLT and JavaScript-based presentation layer.  Look & feel of the Carrot2 Web Application can be easily changed by editing a number of XSLT style sheets. All common style sheets and JavaScripts can be re-used when implementing a new look & feel.

  • High-performance front-end.  The front-end of the Carrot2 Web Application has been optimized for fast loading by using such techniques as JavaScript and CSS merging and minification, as well as using CSS sprites.

Figure 3.3 Carrot2 Web Application results screen

Carrot2 Web Application results screen

3.5.1 Installation and running

To run Carrot2 Web Application:

  1. Make sure you have access to a Servlet API 2.4 compliant container, such as Apache Tomcat.

  2. Download Carrot2 Web Application WAR file.

  3. Deploy the WAR file to your servlet container.

3.6 Carrot2 Command Line Interface

Carrot2 Command Line Interface (CLI) is a set of applications that allow invoking Carrot2 clustering from the command line. Currently, the only available CLI application is Carrot2 Batch Processor, which performs Carrot2 clustering on one or more files in the Carrot2 XML format and saves the results as XML or JSON. Apart from clustering large number of documents sets at one time, you can use the Carrot2 Batch Processor to integrate Carrot2 with your non-Java applications.

3.6.1 Installation and running

To run Carrot2 Batch Processor:

  1. Download and install Java Runtime Environment (version 1.8.0 or newer) if you have not done so.

  2. Download Carrot2 Command Line Interface binaries and extract the archive to some local disk location.

  3. Run batch.cmd (Windows) or batch.sh (Linux) for an overview of the syntax. The Carrot2 Batch Processor ships with two example input data sets located in the input/ directory. Below is a list of some common example invocations.

    • To cluster one or more input files, specify their paths:

      batch input/data-mining.xml input/seattle.xml

      Clustering will be performed using the default clustering algorithm and the results in the XML format will be saved to the output directory relative to the current working directory.

    • You can also cluster files from one or more directories:

      batch input/

      Each directory will be processed recursively, i.e. including subdirectories. For each specified input directory, a corresponding directory with results will be created in the output directory.

    • To save results in the non-default directory, use the -o option:

      batch input/ -o results
    • To repeat the input documents on the output, use the -d option:

      batch input/ -d
    • To save the results in JSON, use the -f JSON option:

      batch input/ -f JSON
    • To use a different clustering algorithm, use the -a option followed by the identifier of the algorithm:

      batch input/ -a url

      To see the list of available algorithm identifiers, run the application without arguments.

    • In case of processing errors, you can use the -v option to see detailed messages and stack traces.

3.7 Carrot2 clustering in Apache Solr

Carrot2 clustering can be performed directly within Solr by means of the Solr Clustering Component contrib extension.

A whitepaper discussing several integration strategies between Solr and Carrot2 clustering algorithms can be found at a separate GitHub repository.

3.8 Carrot2 clustering in ElasticSearch

Carrot2 search results clustering can be performed directly in ElasticSearch by installing a dedicated elasticsearch-carrot2 plugin. Generic plugin's installation instructions are described in detail at the plugin's GitHub web site. The API's documentation is dynamically rendered once installed (see installation instructions).

4 Getting started

Trying Carrot2 clustering with your own data

This chapter will show you how to use Carrot2 in a number of typical scenarios such as trying clustering on your own documents or integrating Carrot2 with your software.

4.1 Requirements

All Carrot2 applications require Java Runtime Environment version 1.8 or later. The Carrot2 Document Clustering Workbench is distributed for Windows, Linux and MacOSX.

The Carrot2 C# API package requires the .NET Framework version 3.5 or later; it does not require a Java Runtime Environment.

4.2 Trying Carrot2 clustering

This section shows how to apply Carrot2 clustering on documents from various sources.

4.2.1 Clustering results from common search engines

To try Carrot2 clustering on results from search engines (such as Microsoft Bing), you can either:

or

  • Use the Carrot2 Document Clustering Workbench which can fetch and cluster documents from the same search engines as the Carrot2 Web Application

4.2.2 Clustering documents from XML files

To try Carrot2 clustering on documents or search results stored in a single XML file you can use the Carrot2 Document Clustering Workbench.

  1. In the Search view of Carrot2 Document Clustering Workbench, choose XML source.

  2. Set path to your XML file in the XML Resource field.

  3. (Optional) If your file is not in Carrot2 format, create an XSLT style sheet that transforms your data into Carrot2 format, see Section 4.2.3 for an example. Provide a path to your style sheet in the XSLT Stylesheet field in the Medium section.

  4. If you know the query that generated the documents in your XML file, you can provide it in the Query field, which may improve the clustering results. Press the Process button to see the results.

Figure 4.1 Carrot2 Document Clustering Workbench XML search view

Carrot2 Document Clustering Workbench XML search view

4.2.3 Clustering documents from XML feeds

To try Carrot2 clustering on documents or search results fetched from a remote XML feed, you can use the Carrot2 Document Clustering Workbench. As an example, we will cluster a news feed from BBC:

  1. In the Search view of Carrot2 Document Clustering Workbench, choose XML source.

  2. Set URL to your XML feed in the XML Resource field. Optionally, the URL can contain two special place holders that will be replaced with the Query and Results number you set in the search view.

    In our example, we will use the BBC News RSS feed.

  3. Create an XSLT style sheet that will transform the XML feed into Carrot2 format. For the news feed we can use the stylesheet shown in Figure 4.2. To add more colour to our results, the XSLT transform extracts thumbnail URLs from the feed and passes them to Carrot2 in a special attribute. Attributes that are a sequence of values can be embedded as shown in Figure 4.3.

  4. Provide a path to the transformation style sheet in the XSLT Stylesheet field in the Medium section.

  5. Press the Process button to see the results.

Figure 4.2 News feed XML to Carrot2 format transformation

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
     xmlns:media="http://search.yahoo.com/mrss">

  <xsl:output indent="yes" omit-xml-declaration="no"
       media-type="application/xml" encoding="UTF-8" />

  <xsl:template match="/">
    <searchresult>
      <xsl:apply-templates select="/rss/channel/item" />
    </searchresult>
  </xsl:template>

  <xsl:template match="item">
    <document>
      <title><xsl:value-of select="title" /></title>
      <snippet>
        <xsl:value-of select="description" />
      </snippet>
      <url><xsl:value-of select="link" /></url>
      <xsl:if test="media:thumbnail">
        <field key="thumbnail-url">
           <value type="java.lang.String"
                  value="{media:thumbnail/@url}"/>
        </field>
      </xsl:if>
    </document>
  </xsl:template>
</xsl:stylesheet>

Figure 4.3 Document attribute that contains a list of values.

<field key="key">
  <value><wrapper class="org.carrot2.util.simplexml.ListSimpleXmlWrapper">
    <list>
      <value value="value1"/>
      <value value="value2"/>
    </list>
  </wrapper></value>
</field>

4.2.4 Clustering documents from a Lucene index

To try Carrot2 clustering on documents from a local Lucene index, you can use Carrot2 Document Clustering Workbench:

  1. In the Search view of Carrot2 Document Clustering Workbench, choose Lucene source.

  2. Choose the path to your Lucene index in the Index directory field.

  3. In the Medium section, choose fields from your Lucene index in at least one of Document title field and Document content field combo boxes.

  4. Type a query and press the Process button to see the results.

Figure 4.4 Carrot2 Document Clustering Workbench Lucene search view

Carrot2 Document Clustering Workbench Lucene search view

4.2.5 Clustering documents from a Solr index

To try Carrot2 clustering on documents from an instance of Apache Solr, you can use Carrot2 Document Clustering Workbench:

  1. In the Search view of Carrot2 Document Clustering Workbench, choose Solr source.

  2. In the Advanced section, provide the URL at which your Solr instance is available in the Service URL field.

  3. In the Medium section, provide fields that should be used as document title, content and URL (optional) in the Title field name, Summary field name and URL field name field, respectively.

  4. Type a query and press the Process button to see the results.

Tip

Carrot2 clustering can also be performed directly within Solr by means of Solr's Carrot2 Clustering Component.

Figure 4.5 Carrot2 Document Clustering Workbench Solr search view

Carrot2 Document Clustering Workbench Solr search view

4.2.6 Saving documents or clusters for further processing

To save doocuments and/or clusters produced by Carrot2 for further processing:

  1. Use Carrot2 Document Clustering Workbench to perform clustering on documents from the source of your choice.

  2. Use the File > Save as... dialog to save the documents and/or clusters into a file in the Carrot2 XML format.

Tip

Saving documents into XML can be particularly useful when there is a need to capture the output of some remote or non-public document source to a local file, which can be then passed on to someone else for further inspection. Documents saved into XML can be opened for clustering within Carrot2 Document Clustering Workbench using the XML document source.

4.3 Integrating Carrot2 with your software

4.3.1 Compiling a Java program using Carrot2 API

The easiest way to integrate Carrot2 with your Java programs is to use the Carrot2 Java API package:

  1. Download Carrot2 Java API and unpack it to some local directory.

  2. Make sure that carrot2-core.jar and all JARs from the lib/ directory are available in the classpath of your program.

  3. Look in the examples/ directory for some sample code. Good places to start are ClusteringDocumentList and ClusteringDataFromDocumentSources. For a complete description of Carrot2 Java API, please see Javadoc documentation in the javadoc/ directory.

  4. You can use the build.xml Ant script to compile and run code from the examples/ directory.

    Tip

    For easier experimenting with Carrot2 Java API, you may want to set up a Carrot2 project in Eclipse IDE.

4.3.2 Adding Carrot2 dependency to a Maven2 project

To add Carrot2 as a dependency to an existing Maven2 project:

  1. Add the following fragment to the dependencies section of your pom.xml:

    <dependency>
      <groupId>org.carrot2</groupId>
      <artifactId>carrot2-core</artifactId>
      <version>3.17.0-SNAPSHOT</version>
    </dependency>

    You should peek at the POM file above and enable optional dependencies if required. For example, to enable Polish stemming, Morfologik should be added to the dependencies section of your pom.xml (version argument should match Carrot2's POM information):

    <dependency>
      <groupId>org.carrot2</groupId>
      <artifactId>morfologik-stemming</artifactId>
      <version>...</version>
    </dependency>
  2. To support snapshot builds, add the following fragment to the repositories section of your pom.xml:

    <repository>
        <id>sonatype-nexus-public</id>
        <name>SonaType public snapshots and releases repository</name>
        <url>https://oss.sonatype.org/content/groups/public</url>
        <releases> 
            <!-- set to true if you wish to fetch releases from this repo too. -->
            <enabled>false</enabled>
        </releases> 
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
    </repository>

4.3.3 Setting up a Carrot2 project in Eclipse IDE

Carrot2 Java API examples can be easily set up in Eclipse IDE. The description below assumes you are using Eclipse IDE version 3.4 or newer.

  1. Download Carrot2 Java API and unpack it to some local directory.

  2. In your Eclipse IDE choose File > New > Java Project.

  3. In the New Java Project dialog (Figure 4.6), type name for the new project, e.g. carrot2-examples. Then choose the Create project from existing source option, provide the directory to which you unpacked the Carrot2 Java API archive and click Finish.

  4. When Eclipse compiles the example classes, you can open one of them, e.g. ClusteringDocumentList and choose Run > Run As > Java Application. The output of the example program should be visible in the Console view.

Figure 4.6 Setting up Carrot2 Java API in Eclipse IDE

Setting up Carrot2 Java API in Eclipse IDE

4.3.4 Setting up Carrot2 source code in Eclipse IDE

Important

To set up Carrot2 source code, you will need Eclipse IDE version 3.5 or later with the Plug-in Development Environment (PDE). The required plugins are avaiilable e.g. in Eclipse for Plug-in Developers and Eclipse Classic distributions available at http://www.eclipse.org/downloads.

  1. Check out Carrot2 source code using git:

    git clone git://github.com/carrot2/carrot2.git
  2. In the Package Explorer view in Eclipse IDE, choose Import... (see Figure 4.7), select General > Existing Projects into Workspace and click Next.

    Figure 4.7 Eclipse IDE Carrot2 project import step 1

    Eclipse IDE Carrot2 project import step 1
  3. In the Import projects dialog provide your local Carrot2 checkout directory in the Select root directory field. Uncheck the org.carrot2.antlib project (see Figure 4.8) and click Finish.

    Figure 4.8 Eclipse IDE Carrot2 project import step 2

    Eclipse IDE Carrot2 project import step 2
  4. All Carrot2 source code should compile without errors. If it does not:

    • Make sure your Eclipse's Java compiler compliance level is set to 1.5 or higher (Preferences > Java > Compiler).

    • Make sure your Eclipse's workspace encoding is set to UTF-8 (Preferences > General > Workspace > Text file encoding).

4.3.5 Compiling a C# program using Carrot2 API

The easiest way to integrate Carrot2 with your C# / .NET programs is to use the Carrot2 C# API package:

  1. Make sure you have .NET framework version 3.5 or later installed in your environment.

  2. Download Carrot2 C# API and unpack it to some local directory.

  3. Compile example code based on the provided msbuild project file:

    CD examples
    C:\Windows\Microsoft.NET\Framework\v4.0.30319\msbuild Carrot2.Examples.csproj
  4. Try running the executable files generated in the examples\ folder.

Tip

The provided msbuild project is not directly compatible with Visual Studio To create a Carrot2 project in Visual Studio, import the example source code and all the referenced DLLs to an existing or newly created project.

4.3.6 Calling Carrot2 clustering from non-Java software

To integrate Carrot2 with your non-Java system, you can use the Carrot2 Document Clustering Server, which exposes Carrot2 clustering as a REST/XML service. Please see Section 3.4.1 for installation instructions and the examples/ directory in the distribution archive for example code in PHP, C# and Ruby.

4.3.7 Java Dependencies

Required

Carrot2 clustering requires a number of JAR files to run. The required JARs are available in the lib/required/ folder of the Carrot2 Java API package. Some of the JARs may not be required in certain specific situations:

  • log4j, slf4j-log4j  Required only if using the Log4j logging framework. If your code uses a different logging framework, add a corresponding SLF4J binding to your classpath.

Optional

A number of optional JARs can be used optionally to increase the quality of clustering in certain languages or fetch search results from external sources. The purpose of the optional JARs is the following:

  • commons-codec, httpclient, httpcore, httpmime  Used by document sources that fetch results from remote search engines, such as Bing7DocumentSource.

  • lucene-core, lucene-highlighter, lucene-memory  Used by the LuceneDocumentSource.

  • rome, rome-fetcher, jdom  Used by the OpenSearchDocumentSource.

  • lucene-analyzers-*  Required for clustering Chinese and Thai content.

  • lucene-analyzers  Required for clustering Arabic content.

  • morfologik-stemming  Required for clustering Polish content.

5 Tuning clustering

Fine-tuning Carrot2 clustering

This chapter discusses a number of typical fine-tuning scenarios for Carrot2 clustering algorithms. Some of the scenarios are relevant to all Carrot2 algorithms, while others are specific to individual algorithms.

5.1 Desirable characteristics of documents for clustering

The quality of clusters and their labels largely depends on the characteristics of documents provided on the input. Although there is no general rule for optimum document content, below are some tips worth considering.

  • Carrot2 is designed for small to medium collections of documents.  The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

  • Provide a minimum of 20 documents.  Carrot2 clustering algorithms will work best with a set of documents similar to what is normally returned by a typical search engine. While about 20 is the minimum number of documents you can reasonably cluster, the optimum would fall in the 100 – 500 range.

  • Provide contextual snippets if possible.  If the input documents are a result of some search query, provide contextual snippets related to that query, similar to what web search engines return, instead of full document content. Not only will this speed up processing, but also should help the clustering algorithm to cover the full spectrum of topics dealt with in the search results.

  • Minimize "noise" in the input documents.  All kinds of "noise" in the documents, such as truncated sentences (sometimes resulting from contextual snippet extraction suggested above) or random alphanumerical strings may decrease the quality of cluster labels. If you have access to e.g. a few sentences' abstract of each document, it is worth checking the quality of clustering based on those abstracts. If you can combine this with the previous tip, i.e. extract complete sentences matching user's query, this should improve the clusters even further.

Let us once again stress that there are no definite generic guidelines for the best content for clustering, it is always worth experimenting with different combinations. You can also describe your specific application on Carrot2 mailing list and ask for advice.

5.2 Choosing the clustering algorithm

Currently, Carrot2 offers two specialized search results clustering algorithms: Lingo and STC as well as an implementation of the bisecting k-means clustering. The algorithms differ in terms of the main clustering principle and hence have different quality and performance characteristics. This section describes briefly the algorithms and provides some recommendations for choosing the most suitable one.

The key characteristic of the Lingo algorithm is that it reverses the traditional clustering pipeline: it first identifies cluster labels and only then assigns documents to the labels to form final clusters. To find the labels, Lingo builds a term-document matrix for all input documents and decomposes the matrix to obtain a number of base vectors that well approximate the matrix in a low-dimensional space. Each such vector gives rise to one cluster label. To complete the clustering process, each label is assigned documents that contain the label's words.

The key data structure used in the Suffix Tree Clustering (STC) algorithm is a Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred more than once in the input documents. Each such word or phrase gives rise to one base cluster. The last stage of the clustering process is merging base clusters to form the final clusters.

The two algorithms have two features in common. They both create overlapping clusterings, in which one document can be assigned to more than one cluster. Also, in case of both algorithms a certain number of documents can remain unclustered and fall in the .

Bisecting k-means is a generic clustering algorithm that can also be applied to clustering textual data. As opposed to Lingo and STC, bisecting k-means creates non-overlapping clusters and does not produce the Other Topics group. Its current limitation is that it labels clusters using individual words and not all cluster's documents may correspond to the words included in the cluster label.

Table 5.1 compares the characteristics of Lingo, STC and k-means under their default settings and Figure 5.1 shows clusters generated by Lingo and STC for data mining search results.

Table 5.1 Characteristics of Lingo and STC clustering algorithms

FeatureLingoSTCk-means
Cluster diversityHigh, many small (outlier) clusters highlightedLow, small (outlier) clusters rarely highlightedLow, small (outlier) clusters rarely highlighted
Cluster labelsLonger, often more descriptiveShorter, but still appropriateOne-word only, may not always describe all documents in the cluster
Scalability Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory[a]. HighLow, based on similar data structures as Lingo.

Figure 5.1 Lingo and STC clusters for the 'data mining' search results

Lingo and STC clusters for the 'data mining' search results

It is difficult to give one clear recommendation as to which algorithm is "better". Many people feel Lingo delivers better-formed and more diverse clusters at the cost of lower performance and scalability. The ultimate judgment, however, should based on the evaluation with the specific document collection. Table 5.2 highlights the scenarios for which the algorithms are best suited.

Table 5.2 Optimum usage scenarios for Lingo and STC

FeatureUse LingoUse STCUse k-means
Well-formed longer labels required  
Highlighting of small (outlier) clusters required  
High clustering performance or large document set processing required  
Need non-overlapping clusters  

The bottom line is: use Lingo, unless you need high-performance clustering of document sets larger than 1000 documents or need non-overlapping clusters.

Tip

For a more scientifically-oriented discussion and evaluation of the two algorithms, please check the publications on Carrot2 website.

Note

Carrot Search, a company founded by Carrot2 authors, offers a commercial document clustering engine called Lingo3G that produces Lingo-quality hierarchical clusters at a better-than-STC speed. Please contact Carrot Search for details.

5.3 Tuning clustering in Carrot2 Document Clustering Workbench

The best tool for experimenting and tuning Carrot2 clustering is the Carrot2 Document Clustering Workbench. Figure 5.2 shows the main components involved in the tuning process.

Figure 5.2 Tuning clustering in Carrot2 Document Clustering Workbench

Tuning clustering in Carrot2 Document Clustering Workbench

1

The results editor presents documents and clusters. Changes made in the Attributes view will affect the currently active results editor.

2

The Attributes view, where you can see and change values of clustering algorithm's attributes.

3

The Attribute Info view, which shows documentation for specific attributes. Hold the mouse pointer over an attribute's label to see its documentation.

Opening the Attributes view.  By default, the Attributes view shows on the right hand side of the Carrot2 Document Clustering Workbench. You can open the view at any time by choosing Window > Show view > Attributes.

Setting modified attributes as default for new queries.  If you modified a number of attributes for an algorithm and would like to use the modified values for new queries, choose the Set as defaults for new queries from the Attributes view's context menu (Figure 5.3).

Figure 5.3 Attributes view's context menu

Attributes view's context menu

Restoring default attribute values.  To reset the attributes to their default values, choose the Reset to defaults option from the Attributes view's context menu (Figure 5.3). To bring the attributes back to their factory defaults, choose the Reset to factory defaults option.

Loading and saving attribute values to XML.  To load or save attribute values to an XML file, use the Open and Save as... options available under the icon on the Attributes view's menu bar.

Accessing attribute documentation.  To see the documentation for a specific attribute, hold the mouse pointer over the attribute's label and its documentation will show in the Attribute Info view.

5.4 Modifying the list of stop words

Please see Section 6.3 and Section 6.2 of Chapter 6 for details.

5.5 Excluding specific clusters from results

Please see Section 6.4 and Section 6.2 of Chapter 6 for details.

5.6 Reducing the size of the Other Topics cluster

The Other Topics cluster contains documents that do not belong to any other cluster generated by the algorithm. Depending on the input documents, the size of this cluster may vary from a few to tens of documents.

By tuning parameters of the clustering algorithm, you can reduce the number of unclustered documents, however bringing the number down to 0 is unachievable in most cases. Please note that minimizing the Other Topics cluster size is usually achieved by forcing the algorithm to create more clusters, which may degrade the perceived clustering quality.

Tip

The easiest way to try different clustering algorithm settings is to use the Carrot2 Document Clustering Workbench.

Tuning Lingo algorithm for smallest Other Topics cluster

To reduce the size of the Other Topics cluster generated by Lingo, you can try applying the following settings:

  1. Change the Factorization method attribute to LocalNonnegativeMatrixFactorizationFactory.

  2. Increase the Cluster count base above the default value.

  3. Decrease the Phrase label boost. Note that this will increase the number of one-word labels, which may not always be desirable.

Tip

To apply the changes to the Carrot2 applications, please follow instructions from Chapter 7.

5.7 Improving clustering performance

As a rule of thumb, the more documents you put on input and the longer the documents are, the larger clustering times. Interestingly, in many cases short document excerpts (such as contextual snippets for search results, title and abstracts or first couple sentences of non-search results) may work just as well or even better than full documents. Hence the first two most important performance tuning tips:

Reduce the size of the input documents  You can achieve this in a few ways:

  • Rather than full text of documents, use their titles and abstracts, if available.

  • In case of search results, use the contextual snippet rather than the full document text. Not only will this improve clustering performance, but it will very likely increase the quality of clusters as well because you will be clustering specifically the fragments the users asked for in their query.

  • If you don't have document abstracts, but have access to some automatically generated summaries, use them. Otherwise, try clustering the title and the first few sentences of each document.

  • In certain cases, you may get decent clustering results with document titles only, this variant is worth trying too.

Reduce the number of input documents  While removing large part of the input document set may not always be an option, in many cases dividing the input into two or more batches, clustering separately and then merging based on cluster label text may give reasonable results. The downside of this approach is that very small clusters containing just a few documents are likely to be lost during this process.

Further performance tuning tips are specific for each clustering algorithm.

5.7.1 Improving performance of Lingo

You can change a number of attributes to increase the performance of Lingo. Most often, performance gain will be achieved at the cost of lowered clustering quality or significant change in the structure of clusters.

  • Lower Factorization quality, which will cause the matrix factorization algorithm to perform fewer iterations and hence complete quicker. Alternatively, you can set Factorization method to org.carrot2.matrix.factorization.PartialSingularValueDecompositionFactory, which is slightly faster than the other factorizations. In the latter case Factorization quality becomes irrelevant.

  • Lower Maximum matrix size, which would cause the matrix factorization algorithm to complete quicker and use less memory. With small matrix sizes, Lingo may not be able to discover smaller clusters.

5.7.2 Improving performance of STC

Not yet covered, please contact us if you need this section.

5.8 Benchmarking clustering performance

You can use the Carrot2 Document Clustering Workbench to run simple performance benchmarks of Carrot2. The benchmarks repeatedly cluster the content of the currently opened editor and report the average clustering time. You can use the benchmarking results to measure the impact of different algorithm's attribute settings on its performance and estimate the the maximum number of clustering requests that the algorithm can process per second.

To perform a performance benchmark:

  1. In the Search view, choose the algorithm to benchmark and perform the query to be used for benchmarking.
  2. Open the Benchmark view.

    Figure 5.4 Carrot2 Document Clustering Workbench Benchmark view

    Carrot2 Document Clustering Workbench Benchmark view
  3. Press Start to start the benchmark. After the benchmark completes, you should see the measured clustering time average, standard deviation, minimum and maximum.

Tip

To asses the performance impact of different attribute settings on one algorithm, you can open two or more editors with the same results clustered by the algorithm, set different attribute values in each editor and run benchmarking for each editor separately. The benchmark view remembers the last result for each editor, so you can compare the performance figures by simply switching between the editors.

Tip

By default, the benchmarking view uses only a single processing unit on multi-processor or multi-core machines. You can increase the number of benchmark threads in the Threads section.

Caution

Benchmark results may vary and be different from the results acquired on production machines due to other programs running in the background, operating system, hardware-specific considerations and even different Java Virtual Machine settings. Always fine-tune your clustering setup in the target deployment environment.

6 Lexical resources

Stemming, tuning common words and filtering cluster labels

Carrot2 will attempt to perform clustering of any textual content, regardless of the actual language the content is written in. However, certain level of shallow linguistic preprocessing usually helps in achieving better clustering and high-quality cluster labels (this is especially true when clustering smaller content, such as search results). Linguistic preprocessing includes the following components and resources:

stemmer

Stemming is the act of folding grammatical variations of words into their base forms. In English, for example, stemming transforms plural word forms into singular ones. For highly inflectional languages, such as Central European languages, stemming may be the key to achieve good clustering results. Carrot2 uses a built-in set of stemmers from the Snowball, Lucene and Morfologik projects.

stop words

Stop words (or common words) include terms that are meaningless in the language. They are typically function words (is, that, in English) or words that are common in the analyzed body of text and should be marked as ignored. A good set of stop words helps the clustering algorithm in identifying gaps between other phrases that can become valuable cluster labels.

stop labels

When clustering domain-specific texts, it is often desirable to filter out certain frequently occurring expressions that should not be considered clusters (home page for example). This resource provides means of avoiding such cluster labels.

Carrot2 comes with a set of default lexical resources which may be used as a starting point for further tuning. It is recommended to gradually build a set of customized lexical resources that matches the specific content being clustered (for example legal documents will have a different set of stop labels than a corpus of e-mails).

6.1 Location of lexical resources

The user-define Carrot2 lexical resources are placed at the following application-specific locations:

Carrot2 Batch Processor

Lexical resources are placed in the resources folder under the distribution folder.

Carrot2 Java API

Lexical resources are placed in the resources folder under the distribution folder. The UsingCustomLexicalResources class demonstrates how to configure controllers to use a given path for loading lexical resources.

Carrot2 Web Application

Lexical resources are placed in the WEB-INF/resources folder of the web application archive (WAR) file.

Carrot2 Document Clustering Server

Lexical resources are placed in the WEB-INF/resources folder of the DCS' web application archive (WAR) file. The WAR file is located in the war/ folder under the distribution folder.

Carrot2 Document Clustering Workbench

Lexical resources are extracted to the workspace folder on first launch. The workspace folder is typically under the Workbench's distribution directory, unless its location is modified by the -data option is passed to the workbench launcher at startup.

Carrot2 core JAR file

Lexical resources are placed at the root of the JAR file. The default lookup location for the lexical resource factory is to scan context class loader's resources and typically (if no other class loader or location that precedes the core JAR contains such resources) these resources will be used by the implementation. Carrot2 Java API contains an example called UsingCustomLexicalResources that demonstrates ways of overriding the default location.

Carrot2 C# API

Lexical resources are embedded in the core assembly. At runtime, all assemblies present in the stack trace of the thread initializing the clustering controller (and thus a certain clustering algorithm) are scanned for resources (the defaults are always scanned last). An example class named UsingCustomLexicalResources, that is provided as part of Carrot2 C# API distribution, demonstrates ways of overriding the default lexical resource search locations from .NET.

Apache Solr clustering plugin

The plugin tries to load the lexical resources from the {solr.home}/conf/clustering/carrot2 directory. If a resource is not found in the directory, the default version of the resource is loaded from Carrot2 JAR.

A different location of lexical resources can be provided using the carrot.lexicalResourceDir Solr parameter. In particular, an absolute path can be provided to share the same lexical resources between multiple Solr cores.

6.2 Tuning lexical resources in Carrot2 Document Clustering Workbench

The easiest way to tune the lexical resources is to use the Carrot2 Document Clustering Workbench which will allow observing the effect of the changes in real time. To tune the lexical resources in Carrot2 Document Clustering Workbench:

  1. Start Carrot2 Document Clustering Workbench and run some query on which you'll be observing the results of your changes.

  2. Go to the workspace/ directory which is located in the directory to which you extracted Carrot2 Document Clustering Workbench. Modify lexical resource files as needed and save the changes.

  3. Open the Attributes view and use the view toolbar's button to group the attributes by semantics. In the Preprocessing section, make sure the Processing language is correctly set and check the Reload resources checkbox. Doing the latter will let you to see the updated clustering results without restarting Carrot2 Document Clustering Workbench every time you save the changed lexical resource files.

    Figure 6.1 Preprocessing attributes section

    Preprocessing attributes section
  4. To re-run clustering after you've saved changes to the lexical resource files, choose the Restart Processing option from the Search menu, or press Ctrl+R (Command+R on Mac OS).

    Figure 6.2 Carrot2 Document Clustering Workbench restart clustering button

    Carrot2 Document Clustering Workbench restart clustering button

6.3 Stop word files

Stop word files are UTF-8 encoded plain text files with a single word in each line. Lines starting with # are omitted (considered to be comments). Files must follow a naming convention and be named stopwords.lang, where lang is a two-letter language suffix defined in LanguageCode class.

Example 6.1 A sample stop word file for English: stopwords.en

# stop word file for English
ain't
thanks
need
needs
needed
vs
hit

Important

Note that although words provided in the stop word file will be handled in a case-insensitive manner, they will otherwise be taken literally, that is no further processing, such as stemming will be applied. As a result, in order to declare that all have, has and having are function words, three entries corresponding to these words are required.

6.4 Label filtering files

The Lingo clustering algorithm, in addition to stop words editing, offers more precise control over cluster labels by means of "stop label" regular expressions. If a cluster's label matches one of the stop labels, the label will not appear on the list of clusters produced by Lingo.

Label filtering files are UTF-8 encoded plain text files with a single regular expression pattern in each line. Lines starting with # are omitted (considered to be comments). Files must follow a naming convention and be named stoplabels.lang, where lang is a two-letter language suffix defined in LanguageCode class.

Each line of a stop labels file corresponds to one stop label and is a Java regular expression. Please note that in order to be removed, a label as a whole must match at least one of the stop label expressions. A number of example stop label expressions are shown below.

Example 6.2 A sample stop label file for English: stoplabels.en

# stop label patterns for English
(?i)new
(?i)information (about|on).*
(?i)(index|list) of.*
(?i)(information|list|skip|join|cheap|access(es)?|corp(oration)?s?)
(?i).*(page|part|copyright) \d+.*
(?i)(official|offer(ing)?s?|lists|uses?).*
(?i).*(known|information|offer(ing)?s?|a range)

All stop labels shown above start with the (?i) prefix, which enables case-insensitive matching for them. The stop label in the first line suppresses labels consisting solely of the word new. The stop label in the second line removes labels that start in information about or information on, and the stop label in the third line removes labels that start with index of or list of.

7 Customization

Customizing Carrot2 tools

This chapter will show you how to add new document sources and tune clustering in Carrot2 applications.

7.1 Component suites and attributes

Key concepts in customizing and tuning Carrot2 applications are component suites and component attributes described in the following sections.

7.1.1 Component suites

Component suite is a set of Carrot2 components, such as document sources or clustering algorithms, configured to work within a specific Carrot2 application. For each component, the component suite defines the component's identifier, label, description and also a number of component- and application-specific properties, such as the list of example queries.

Component suites are defined in XML files read from application-specific locations described in further sections of this chapter. An example component suite definition is shown in Figure 7.1.

Figure 7.1 Example Carrot2 component suite

<component-suite>
  <sources>
    <source id="lucene"
        component-class="org.carrot2.source.lucene.LuceneDocumentSource"
        attribute-sets-resource="lucene.attributes.xml">
      <label>Lucene</label>
      <title>Apache Lucene</title>
      <mnemonic>L</mnemonic>
      <description>
        Apache Lucene index (local index access).
      </description>
      <icon-path>icons/lucene.png</icon-path>
      <example-queries>
        <example-query>data mining</example-query>
        <example-query>london</example-query>
        <example-query>clustering</example-query>
      </example-queries>
    </source>
  </sources>
  
  <algorithms>
    <algorithm id="lingo" 
        component-class="org.carrot2.clustering.lingo.LingoClusteringAlgorithm" 
        attribute-sets-resource="lingo.attributes.xml">
      <label>Lingo</label>
      <title>Lingo Clustering</title>
    </algorithm>
  </algorithms>
  
  <include suite="source-bing.xml" />
  <include suite="algorithm-stc.xml" />
</component-suite>

The component suite definition can consist of the following elements:

  • sources  Document source definitions, optional.

  • algorithms  Clustering algorithm definitions, optional.

  • include  Includes other XML component suite definitions, optional. The resource specified in the suite attribute will be loaded from the current thread's context class loader.

Common parts of the source and algorithm tags include:

  • id  Identifier of the component within the suite, required. Identifiers must be unique within the component suite scope.

  • component-class  Fully qualified name of the processing component class, required.

  • attribute-sets-resource  XML file to load the component's attributes from. The resource specified in this attribute will be loaded from the current thread's context class loader. For the syntax of the XML file, please see Section 7.1.2.

  • label  A human readable label of the component, required.

  • label  A human readable title of the component, required. The title will be usually slightly longer than the label.

  • description  A longer description of the component, optional.

  • icon-path  Application specific definition of the component's icon.

Additionally, for the source tag you can use the example-queries tag to specify some example queries the applications may show for this source.

7.1.2 Component attributes

Component attribute is a specific property of a Carrot2 component that influences its behavior, e.g. the number of search results fetched by a document source or the depth of cluster hierarchy produced by a clustering algorithm. Each attribute is identified by a unique string key, Chapter 12 lists and describes all available components and their attributes.

You can specify attribute values for specific components in the component suite using attribute sets. Attribute sets are defined in XML files referenced by the attribute-sets-resource attribute of the component's entry in the component suite. Figure 7.2 shows an example attribute set definition.

Figure 7.2 Example Carrot2 attribute set

<attribute-sets>
  <attribute-set id="lucene">
    <value-set>
      <label>Lucene</label>
      <attribute key="LuceneDocumentSource.directory">
        <value>
           <wrapper class="org.carrot2.source.lucene.FSDirectoryWrapper">
              <indexPath>/path/to/lucene/index/directory</indexPath>
           </wrapper>
        </value>
      </attribute>
      <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.contentField">
        <value type="java.lang.String" value="summary" />
      </attribute>
      <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.titleField">
        <value type="java.lang.String" value="title" />
      </attribute>
      <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.urlField">
        <value type="java.lang.String" value="url" />
      </attribute>
    </value-set>
  </attribute-set>
</attribute-sets>

An attribute-sets element can contain one or more attribute-sets. Each attribute-set must specify a unique id and a value-set.

Saving attributes to XML using Carrot2 Document Clustering Workbench  As the syntax of the value elements depends on the type of the attribute being set, the easiest way to obtain the XML file is to use the Carrot2 Document Clustering Workbench.

To generate attribute set XML for a document source:

  1. In the Search view, choose the document source for which you would like to save attributes.

  2. Use the Search view to set the desired attribute values.

  3. Choose the Save as... option from Search view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of the document source's attribute-sets-resource attribute.

Note

Please note that the Carrot2 Document Clustering Workbench will remove a number of common attributes from the XML file being saved, including: query, start result index, number of results.

To generate attribute set XML for a clustering algorithm:

  1. In the Search view, choose the clustering algorithm for which you would like to save attributes. Choose any document source and perform processing using the selected algorithm.

  2. Use the Attributes view to set the desired attribute values.

  3. Choose the Save as... option from Attribute view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of the clustering algorithm's attribute-sets-resource attribute.

Tip

If for some reason you cannot use the Carrot2 Document Clustering Workbench to save attribute set XML files, you can modify the SavingAttributeValuesToXml class from the carrot2-examples package to correspond to the attribute values you would like to set and run the class to print the XML encoding of the attribute values to the standard output.

7.2 Adding document sources to Carrot2 Web Application

To add a document source tab to the Carrot2 Web Application:

  1. Open for editing the suite-webapp.xml file, located in the WEB-INF/suites directory of the WAR file.

  2. Add a descriptor for the document source you want to add to the sources section of the suite-webapp.xml file. Alternatively, you may want to use the include element to reference one of the example document source descriptors shipped with the application (e.g. source-lucene.xml). Please see Section 7.1.1 for more information about the component suite XML file.

  3. If the document source you are adding requires setting specific attribute values (e.g. index location for the Lucene document source), use the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated XML file in WEB-INF/suites and make sure it is appropriately referenced by the attribute-sets-resource attribute of the descriptor added in the previous step.

  4. Deploy the WAR file with the above modifications to your container. If the new document source tab is not showing, clear cookies for the domain on which the web application is deployed.

7.3 Adding document sources to Carrot2 Document Clustering Server

To add a document source tab to the Carrot2 Document Clustering Server:

  1. Open for editing the suite-dcs.xml file, located in the WEB-INF/suites directory of the DCS WAR file located in the war/ of the DCS distribution.

  2. Add a descriptor for the document source you want to add to the sources section of the suite-dcs.xml file. Alternatively, you may want to use the include element to reference one of the example document source descriptors shipped with the application (e.g. source-lucene.xml). Please see Section 7.1.1 for more information about the component suite XML file.

  3. If the document source you are adding requires setting specific attribute values (e.g. index location for the Lucene document source), use the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated XML file in WEB-INF/suites and make sure it is appropriately referenced by the attribute-sets-resource attribute of the descriptor added in the previous step.

  4. Restart the DCS. The new document source should be available for processing.

7.4 Customizing Lingo for Carrot2 Web Application

To run the Carrot2 Web Application with custom attributes of the Lingo clustering algorithm:

  1. Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.

  2. Replace the contents of lingo.attributes.xml, located in the WEB-INF/suites directory of the web application WAR file, with the XML file saved in the previous step.

  3. Deploy the WAR file with the above modifications to your container.

You can use the same procedure to customize other algorithms, e.g. STC.

7.5 Customizing Lingo for Carrot2 Document Clustering Server

To run the Carrot2 Document Clustering Server with custom attributes of the Lingo clustering algorithm:

  1. Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.

  2. Replace the contents of algorithm-lingo-attributes.xml, located in the WEB-INF/suites directory of the DCS WAR file, located in the war/ directory of the DCS distribution, with the XML file saved in the previous step.

  3. Restart the DCS.

You can use the same procedure to customize other algorithms, e.g. STC.

7.6 Customizing Lingo for Carrot2 Command Line Interface

To run the Carrot2 Command Line Interface with custom attributes of the Lingo clustering algorithm:

  1. Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.

  2. Replace the contents of algorithm-lingo-attributes.xml, located in the /suites directory of the CLI distribution, with the XML file saved in the previous step.

  3. Run the CLI application.

You can use the same procedure to customize other algorithms, e.g. STC.

7.7 Customizing Lingo in Carrot2 Java API

The Java API distribution package contains examples showing how to customize attributes of the clustering algorithms. Please see the org.carrot2.examples.clustering.UsingAttributes class or the JavaDoc overview page.

7.8 Adding document sources to Carrot2 Document Clustering Workbench

Not yet covered, please contact us if you need this section.

8 Advanced topics

Building and running Carrot2 from source code

This chapter discusses more advanced usage scenarios of Carrot2 such as running Carrot2 applications in Eclipse and building Carrot2 from source code.

8.1 Running Carrot2 in Eclipse IDE

8.1.1 Running Carrot2 Document Clustering Workbench in Eclipse IDE

To run Carrot2 Document Clustering Workbench in Eclipse IDE (version 3.4 or higher required):

  1. Set up Carrot2 source code in your Eclipse IDE.

  2. Choose Window > Preferences and then Run/Debug > String substitution. Add a temp_workspaces variable pointing to a an existing disk directory where the Workbench's workspace should be created.

  3. Choose Run > Run Configurations... from the main menu and run the Workbench configuration.

    Figure 8.1 Workbench Run Configuration

    Workbench Run Configuration

8.1.2 Running Carrot2 Web Application in Eclipse IDE

To run Carrot2 Document Clustering Workbench in Eclipse IDE:

  1. Set up Carrot2 source code in your Eclipse IDE.

  2. Choose Run > External Tools > External Tools Configurations... from the main menu and run the Web Application Setup [carrot2] configuration. This will preprocess various configuration files required by the web application.

  3. Choose Run > Run Configurations... from the main menu and run the Web Application Runner [carrot2] configuration.

  4. Point your browser to http://localhost:8080 to access the running web application.

8.2 Building Carrot2 from source code

To build Carrot2 applications from source code, you will need Java Software Development Kit (Java SDK) version 1.8 or higher and Apache Ant version 1.9.3 or higher. You can chcek out the latest Carrot2 source code using git:

git clone git://github.com/carrot2/carrot2.git

8.2.1 Building Carrot2 Document Clustering Workbench

To build Carrot2 Document Clustering Workbench from source code:

  1. Download Eclipse Target Platform from http://download.carrot2.org/eclipse and extract to some local folder.

  2. Copy local.properties.example from Carrot2 checkout folder to local.properties in the same folder. In local.properties edit the target.platform property to point to the Eclipse Target Platform you have downloaded.

    Important

    The folder pointed to by target.platform must have the eclipse/ folder inside.

    You may also change the configs property to match the platform you want to build Carrot2 Document Clustering Workbench for or rely on auto-detection.

  3. Run:

    ant workbench

    to build Carrot2 Document Clustering Workbench binaries.

  4. Go to the tmp/ workbench/ tmp/ carrot2-workbench folder in the Carrot2 checkout dir and run Carrot2 Document Clustering Workbench.

8.2.2 Building Carrot2 Web Application

To build Carrot2 Web Application from source code:

  1. Run:

    ant webapp

    in the main Carrot2 checkout directory.

  2. Go to the tmp/webapp/ folder in the Carrot2 checkout dir where you will find the web application WAR file.

8.3 Using Carrot2 Document Clustering Server with curl

You can use curl to post requests to the Carrot2 Document Clustering Server Figure 8.2 shows how to use curl to query an external document source and cluster the results using the DCS. Figure 8.3 shows how to cluster documents from an XML file in Carrot2 format using the DCS. Please see the examples/curl directory of the Carrot2 Document Clustering Server distribution archive for more curl DCS invocation examples.

Figure 8.2 Using DCS and curl to cluster data from document source

curl http://localhost/dcs/rest \
     -F "dcs.source=etools" \
     -F "query=test" \
     -o result.xml

Figure 8.3 Using DCS and curl to cluster data from document source

curl http://localhost/dcs/rest \
     -F "dcs.c2stream=@documents-in-carrot2-format.xml" \
     -o result.xml

Tip

You can download curl for Windows from http://curl.haxx.se/latest.cgi?curl=win32-nossl.

8.4 Working with HTTP proxies

If your server or development machine connects to HTTP servers via a HTTP proxy, you can most of Carrot2 document source implementations to take this information into account by defining the following global system properties:

http.proxyhost

URL of the HTTP proxy (numeric or full address, but without the port number).

http.proxyport

Proxy server's port number.

Two sources that currently do not support the above properties are: Bing7DocumentSource and OpenSearchDocumentSource.

8.5 HTTP BASIC or DIGEST authentication

If your document source initiates HTTP connections to a server protected with BASIC or DIGEST HTTP authentication, you will have to pass the username and password to the application so that such connections can be established. Define the following global system properties (they are picked up once, when the Controller is created):

http.auth.username

Username for BASIC or DIGEST authentication.

http.auth.password

Password for BASIC or DIGEST authentication.

Note that, in general, it's better not to have any HTTP authentication at all since it's a very weak form of protection anyway and only increases network traffic (two HTTP requests may have to be made in order to fetch the remote resource).

9 Troubleshooting

Solving common problems with Carrot2

This chapter discusses solutions to some common problems with Carrot2 code or applications.

9.1 Troubleshooting Carrot2 Document Clustering Workbench

9.1.1 Increasing memory size

To increase Java heap size for Carrot2 Document Clustering Workbench, use the following command line parameters:

carrot2-workbench -vmargs -Xmx256m

Tip

Using the above pattern you can specify any other JVM options if needed.

Tip

You can also add JVM path and options to the eclipse.ini file located in in Carrot2 Document Clustering Workbench installation directory. Please see Eclipse Wiki for a list of all available options.

9.1.2 Getting exception stack trace

To get the stack trace (useful for Carrot2 team to spot errors) corresponding to a processing error in Carrot2 Document Clustering Workbench, follow the following procedure:

  1. Click OK on the Problem Occurred dialog box (Figure 9.1).

    Figure 9.1 Carrot2 Document Clustering Workbench error dialog

    Carrot2 Document Clustering Workbench error dialog
  2. Go to Window > Show view > Other... and choose Error Log (Figure 9.2).

    Figure 9.2 Carrot2 Document Clustering Workbench Show View dialog

    Carrot2 Document Clustering Workbench Show View dialog
  3. In the Error Log view double click the line corresponding to the error (Figure 9.3).

    Figure 9.3 Carrot2 Document Clustering Workbench Error Log view

    Carrot2 Document Clustering Workbench Error Log view
  4. Copy the exception stack trace from the Event Details dialog and pass to Carrot2 team (Figure 9.4).

    Figure 9.4 Carrot2 Document Clustering Workbench Event Details dialog

    Carrot2 Document Clustering Workbench Event Details dialog

9.2 Troubleshooting Carrot2 Web Application

9.2.1 "?" characters instead of Unicode special characters

Symptoms

If you see question marks ("?") instead of Chinese, Polish or other special Unicode characters in clusters and documents output by the Carrot2 Web Application

Cause

The Carrot2 Web Application running under a Web application container (such as Tomcat) relies on proper decoding of Unicode characters from the request URI. This decoding is done by the container and must be properly configured at the container level. Unfortunately, this configuration is not part of the J2EE standard and is therefore different for each container.

Solution for Apache Tomcat

For Apache Tomcat, you can enforce the URI decoding code page at the connector configuration level. Locate server.xml file inside Tomcat's conf folder and add the following attribute to the Connector section:

URIEncoding="UTF-8"

A typical connector configuration should look like this:

<Connector port="8080" maxThreads="25" 
    minSpareThreads="5" maxSpareThreads="10" 
    minProcessors="5" maxProcessors="25" 
    enableLookups="false" redirectPort="8443" 
    acceptCount="10" debug="0" 
    connectionTimeout="20000" URIEncoding="UTF-8" />

10 Architecture and API

Discussion of Carrot2 internals

This chapter discusses some Carrot2 architecture assumptions, internals and more complex API use cases.

10.1 Carrot2 architecture overview

This section provides a very brief overview of Carrot2 architecture. If you would like us to cover some specific topic in more detail, please let us know on the mailing list.

10.1.1 Processing component pipeline

Processing in Carrot2 is based on a pipeline of processing components. The two main types of Carrot2 processing components are:

  • Document Sources  provide data for further processing. In a typical scenario, such a component would fetch search results from e.g. an external search engine, Lucene / Solr index or an XML file. Currently, Carrot2 distribution contains 12 different document source components.

  • Clustering Algorithms  organize documents provided by document sources into meaningful groups. Currently, two specialized clustering algorithms are available in Carrot2: Lingo and STC. Additionally, a number of "synthetic" clustering algorithms are available, such as by URL clustering.

Carrot2 applications, such as Carrot2 Document Clustering Workbench or Carrot2 Document Clustering Server operate on a pipeline consisting of one document source and one clustering algorithm, but using Carrot2 Java API you can insert additional components at any point in the pipeline. Currently, the only component not falling into the above categories is a component for computing certain cluster quality metrics, but more components may be added in the future, e.g. for spell checking of user queries.

10.1.2 Processing component attributes

The behavior of both document sources and clustering algorithms depends on a number of attributes (settings) such as the number of documents to fetch or the number of clusters to produce. The way you provide attribute values for specific components depends on the Carrot2 application you are working with:

  • Carrot2 Document Clustering Workbench.  In Carrot2 Document Clustering Workbench you can provide attributes for document sources (such as number of results to fetch or preferred results language) before you issue a query in the Search view. Clustering algorithm attributes you can change using the sliders in the Attributes view.

  • Carrot2 Document Clustering Server.  In Carrot2 Document Clustering Server, you can provide attribute values as additional parameters in the POST request. Name of the POST parameter should be the identifier of the attribute you want to set (see Chapter 12 for attribute identifiers). Carrot2 will attempt to convert the string value of the parameter to the required type (integer, float etc.).

For a complete reference of attributes of each Carrot2 component, please see Chapter 12.

10.2 Carrot2 XML data formats

This section shows examples of Carrot2 input and output XML formats, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Workbench, Carrot2 Document Clustering Server and Carrot2 Web Application.

10.2.1 Carrot2 input XML format

To provide documents for Carrot2 clustering, use the following XML format:

Figure 10.1 Carrot2 input XML format

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
  <query>Globe</query>
  <document id="0">
    <title>default</title>
    <url>http://www.globe.com.ph/</url>
    <snippet>
      Provides mobile communications (GSM) including 
      GenTXT, handyphones, wireline services, an
      broadband Internet services.
    </snippet>
  </document>
  <document id="1">
    <title>Skate Shoes by Globe | Time For Change</title>
    <url>http://www.globeshoes.com/</url>
    <snippet>
      Skaters, surfers, and showboarders
      designing in their own style.
    </snippet>
  </document>

  ...

</searchresult>

10.2.2 Carrot2 output XML format

Carrot2 saves the clusters in the following XML format:

Figure 10.2 Carrot2 output XML format

<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
  <query>Globe</query>
  <document id="0">
    <title>default</title>
    <url>http://www.globe.com.ph/</url>
    <snippet>
      Provides mobile communications (GSM) including 
      GenTXT, handyphones, wireline services, an
      broadband Internet services.
    </snippet>
  </document>
  <document id="1">
    <title>Skate Shoes by Globe | Time For Change</title>
    <url>http://www.globeshoes.com/</url>
    <snippet>
      Skaters, surfers, and showboarders
      designing in their own style.
    </snippet>
  </document>

  ...

  <group id="0" size="60" score="1.0">
    <title>
      <phrase>com</phrase>
    </title>
    <group id="1" size="2" score="1.0">
      <title>
        <phrase>amazon.com</phrase>
      </title>
      <document refid="43"/>
      <document refid="77"/>
    </group>
    <group id="2" size="2" score="0.8">
      <title>
        <phrase>boston.com</phrase>
      </title>
      <document refid="4"/>
      <document refid="7"/>
    </group>
    
    ...
    
    <group id="7" size="48">
      <title>
        <phrase>Other Sites</phrase>
      </title>
      <attribute key="other-topics">
        <value type="java.lang.Boolean" value="true"/>
      </attribute>
      <document refid="1"/>
      <document refid="2"/>
      ...
    </group>
  </group>
  <group id="8" size="12" score="0.72">
    <title>
      <phrase>org</phrase>
    </title>
    <group id="9" size="2" score="1.0">
      <title>
        <phrase>en.wikipedia.org</phrase>
      </title>
      <document refid="9"/>
      <document refid="14"/>
      ...
    </group>
  </group>
  ...


</searchresult>

10.3 Carrot2 JSON data format

This section shows examples of Carrot2 output JSON format, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Server and Carrot2 Java API.

10.3.1 Carrot2 output JSON format

Carrot2 saves documents and the clusters in the following JSON format:

Figure 10.3 Carrot2 output JSON format

{
  "clusters": [
    {
      "attributes": {
        "score": 1.0
      }, 
      "documents": [
        0, 
        2
      ], 
      "id": 0, 
      "phrases": [
        "Cluster 1"
      ], 
      "score": 1.0, 
      "size": 2
    }, 
    {
      "attributes": {
        "score": 0.63
      }, 
      "clusters": [
        {
          "attributes": {
            "score": 0.3
          }, 
          "documents": [
            1
          ], 
          "id": 2, 
          "phrases": [
            "Cluster 2.1"
          ], 
          "score": 0.3, 
          "size": 1
        }, 
        {
          "attributes": {
            "score": 0.15
          }, 
          "documents": [
            2
          ], 
          "id": 3, 
          "phrases": [
            "Cluster 2.2"
          ], 
          "score": 0.15, 
          "size": 1
        }
      ], 
      "documents": [
        0
      ], 
      "id": 1, 
      "phrases": [
        "Cluster 2"
      ], 
      "score": 0.63, 
      "size": 3
    }
  ], 
  "documents": [
    {
      "id": 0, 
      "snippet": "Document 1 Content.", 
      "title": "Document 1 Title", 
      "url": "http://document.url/1"
    }, 
    {
      "id": 1, 
      "snippet": "Document 2 Content.", 
      "title": "Document 2 Title", 
      "url": "http://document.url/2"
    }, 
    {
      "id": 2, 
      "snippet": "Document 3 Content.", 
      "title": "Document 3 Title", 
      "url": "http://document.url/3"
    }
  ], 
  "query": "query (optional)"
}

11 Carrot2 Development

Contributing to Carrot2

This chapter contains information for Carrot2 developers.

11.1 Stable release procedure

Each Carrot2 release should be performed according to the following procedure:

  1. Precondition: resolved issues  All issues related to the software to be released scheduled (fix for) for the release must be resolved.

  2. Precondition: successful continuous integration builds  The status of the all builds must be successful. For bugfixing releases, check appropriate build on the server.

  3. Update source code headers and line endings 

    ant prerelease

    Commit changes to trunk.

  4. Review Maven dependencies are in sync 

    (cd etc/maven/poms; mvn dependency:tree )

    Review Maven POMs to ensure dependencies are in sync with the JAR versions in the repository.

  5. Run all the tests and distribution target 

    git clean -xfd # removes any local files, including settings!
    ant -Dlocal.properties=local.properties.example -Dtools.dir=... clean dist

    Everything should pass. Extra tools repo will be required.

  6. Generate and verify JavaDocs 

    ant javadoc # (already in dist)

    Review JavaDoc documentation, provide missing public and protected members description, provide missing package descriptions.

  7. Generate and verify Carrot2 Manual 

    ant doc # (already in dist)

    Review Carrot2 Manual, modify or add content related to the features implemented in the new release.

  8. Review static code analysis reports 

    ant reports

    Review and fix reasonably-looking flaws.

  9. Update version number strings  Update carrot2.version and remove -SNAPSHOT suffix. This number will be embedded in distribution file names, JavaDoc page title and other version-sensitive places.

  10. Generate API XML file and API differences.  Pick the previous version to compare against (typically the previous version on the branch). Generate API XML and a comparison report:

    ant clean jdiff-compare -Dversion.previous=x.y.z

    Copy API XML report for future comparisons:

    cp tmp/compatibility-report/*.xml etc/jdiff/

    Commit changes. Push.

  11. Trigger stable build in Bamboo.  Go to Carrot2 Bamboo (requires admin privileges) and trigger a stable build. If the build is successful, all distribution files should be available in the download directory. This is the "candidate" release.

  12. Verify the distribution files  Download, unpack and run each distribution file to make sure there are no obvious release blockers.

  13. Create an annotated release tag and push changes. 

    git tag -a release/x.y.z -m "Release x.y.z"

  14. Trigger stable build in Bamboo.  Go to Carrot2 Bamboo (requires admin privileges) and trigger a stable build again. This is the final release.

  15. Publish maven artefacts.  First,

    ant maven.deploy

    this pushes a release to SonaType's staging area (appropriate sonatype server configuration in ~/.m2/settings.xml and GPG keys in ~/.gnupg/ required). Log in to SonaType, close the release bundle and publish. This can be done later from the tagged revision.

  16. Bump version number strings  Bump version number to the next anticipated version and add -SNAPSHOT. Commit changes.

  17. Update JIRA  Close issues scheduled for the release being made, release the version in JIRA, create a next version in JIRA.

  18. Release on github and update downloads  Staging server has build files (/srv/vhosts/get.carrot2.org/head/) upload them to github (https://github.com/carrot2/carrot2/releases) and create release news for the tag.

  19. Update project website 

    1. Release notes  Add a page named release-[version]-notes that lists new features, major bug fixes and improvements introduced in the new release.

    2. Release note history  Add release date and link to the release's JIRA issues on the release-notes page.

  20. Circulate release news  If appropriate, circulate release news to:

    1. Carrot2 mailing lists

  21. Update Wikipedia page  If appropriate, update Carrot2 page on Wikipedia.

  22. Consider upgrading Carrot2 in dependent projects  If reasonable, upgrade Carrot2 dependency in other known projects, such as Apache Solr.

11.2 Versioning scheme

Carrot2 uses version identifiers consisting of three, dot-separated numbers: product-line.major.minor. This scheme is modelled after Maven's POM versions and has the following interpretation:

product-line

Indicates long-term product line identifier. This number will not change frequently as it reflects major changes in the internal architecture or shipped software components. Reading release notes is a must, the internal programming interfaces very likely changed significantly.

major

Major revision number changes indicate addition of significant new features, performance optimizations or new front-end software components added to Carrot2. Reading release notes is highly recommended because programming interfaces may change slightly from major to major revision.

minor

Minor revision numbers are reserved for shipped product updates and bug fixes. These may include critical bug fixes as well as patches increasing performance, but not changing the programming interfaces. Reading release notes is recommended, but a drop-in upgrade should work without any extra work.

The git repository is organized so that the master branch tracks the development of the next major revision. Bugfix branches track minor revisions of already shipped versions. A tag is created for each shipped version. Branch and tag names follow the naming conventions below.

master

The master branch is equivalent to the next major software revision being developed and is not numbered explicitly, but corresponds to branch vX.Y.0, where Y is the next major revision to be shipped. It is possible to create a minor release off the trunk directly if the commit log only includes bug fixes.

bugfix/X.Y.Z

A branch named bugfix/X.Y.z tracks the product shipped as X.Y.z, where the z component is the next minor release to be shipped from this branch. Once shipped, a tag should be created.

release/X.Y.Z

A tag named release/X.Y.Z should be created for exactly that development branch at the time of shipment.

11.3 QA check list

This a very quick quality assurance check list to run through before stable releases. This list also serves as some guide line for further automation of acceptance tests.

Note

Note that this list does not contain many checks for the Carrot2 Web Application, Carrot2 Document Clustering Server and Carrot2 Java API as these are fairly well tested during builds (webtests, smoke-tests).

  1. For each supported platform you can test, check that Carrot2 Document Clustering Workbench:

    1. launches without errors in the error log

    2. executes and cluters a remote search query without errors

    3. executes and clusters a Lucene query without errors (we've had a bug that caused the Lucene directory attribute editor to disappear, hence this step).

    4. can edit a clustering algorithm's attribute

    5. shows both cluster visualizations

    6. executes clustering algorithm benchmarks

  2. Check that a the Carrot2 Document Clustering Server starts up correctly using command line on Windows and Linux. More acceptance tests are performed during builds (but starting Carrot2 Document Clustering Server using the WAR file instead of command line).

12 Component reference

Detailed description of all Carrot2 components

This section lists and describes attributes of all Carrot2 components. By changing values of these attributes, you can change the behaviour of the component. Please see Chapter 7 for information on how you pass attribute values in different Carrot2 applications.

Each attribute is described by a number of properties:

  • Key  The unique identifier of the attribute.

  • Direction 

    • Input  The attribute is an input for the component, the behaviour of the component depends on its value.

    • Output  The attribute is an output produced by the component.

  • Level  Informs how advanced the attribute is.

    • Basic  Attribute value should be fairly easily tunable by a person without significant experience in text clustering.

    • Medium  Attribute value should be fairly easily tunable by a person without some intuition about text clustering

    • Advanced  Attribute may require in-depth knowledge of the component for successful tuning.

  • Required  If true and the attribute does not have a default value, a value must be provided for the component to perform processing.

  • Scope 

    • Initialization time  Attribute value will be respected only when the component is initializing; values provided at processing time will be ignored. This scope applies to the attributes that control time-consuming operations performed once per component instance (e.g. parsing of configuration files). As a result, only a handful of attributes fall into the initialization-time only scope.

    • Processing time  Attribute values will be respected both at initialization and clustering time. Most of the attributes fall into this scope.

    Please note that certain attributes can be both initialization- and processing-time. In most such cases it is advisable to provide the value at initialization time because processing the same value passed at processing time may degrade the performance a little (e.g. due to re-reading configuration files).

  • Value type  The Java type of the attribute's value.

  • Default value  The default value of the attribute or none if there is no default value defined for the attribute.

12.1 By Source Clustering

12.1.1 By Source Clustering input attributes by level

12.1.2 By Source Clustering attributes by direction

Output

12.1.3 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required no
Scope Processing time
Value type java.util.List
Default value none
Attribute builder ByFieldClusteringAlgorithmDescriptor.​AttributeBuilder#documents()

12.1.4 Fields

Field name

Key ByAttributeClusteringAlgorithm.fieldName
Direction Input
Level BASIC
DescriptionName of the field to cluster by. Each non-null scalar field value with distinct hash code will give rise to a single cluster, named using the value returned by org.carrot2.clustering.synthetic.ByFieldClusteringAlgorithm.buildClusterLabel(Object). If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way.
Required yes
Scope Processing time
Value type java.lang.String
Default value sources
Value contentMust not be blank
Attribute builder ByFieldClusteringAlgorithmDescriptor.​AttributeBuilder#fieldName()

12.1.5 Search result information

Clusters

Key clusters
Direction Output
DescriptionClusters created by the algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder ByFieldClusteringAlgorithmDescriptor.​AttributeBuilder#clusters()

12.2 By URL Clustering

12.2.1 By URL Clustering input attributes by level

12.2.2 By URL Clustering attributes by direction

Input

Output

12.2.3 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required no
Scope Processing time
Value type java.util.List
Default value none
Attribute builder ByUrlClusteringAlgorithmDescriptor.​AttributeBuilder#documents()

12.2.4 Search result information

Clusters

Key clusters
Direction Output
DescriptionClusters created by the algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder ByUrlClusteringAlgorithmDescriptor.​AttributeBuilder#clusters()

12.3 Bisecting k-means

12.3.3 Clusters

Cluster count

Key BisectingKMeansClusteringAlgorithm.clusterCount
Direction Input
Level BASIC
DescriptionThe number of clusters to create. The algorithm will create at most the specified number of clusters.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 25
Min value 2
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#clusterCount()

Label count

Key BisectingKMeansClusteringAlgorithm.labelCount
Direction Input
Level BASIC
DescriptionLabel count. The minimum number of labels to return for each cluster.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 3
Min value 1
Max value 10
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#labelCount()

12.3.4 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Required yes
Scope Processing time
Value type java.util.List
Default value none
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#documents()

12.3.5 K-means

Maximum iterations

Key BisectingKMeansClusteringAlgorithm.maxIterations
Direction Input
Level BASIC
DescriptionThe maximum number of k-means iterations to perform.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 15
Min value 1
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#maxIterations()

Partition count

Key BisectingKMeansClusteringAlgorithm.partitionCount
Direction Input
Level BASIC
DescriptionPartition count. The number of partitions to create at each k-means clustering iteration.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 2
Max value 10
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#partitionCount()

Use dimensionality reduction

Key BisectingKMeansClusteringAlgorithm.useDimensionalityReduction
Direction Input
Level BASIC
DescriptionUse dimensionality reduction. If true, k-means will be applied on the dimensionality-reduced term-document matrix with the number of dimensions being equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. If false, the k-means will be performed directly on the original term-document matrix.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#useDimensionalityReduction()

12.3.6 Labels

Title word boost

Key TermDocumentMatrixBuilder.titleWordsBoost
Direction Input
Level MEDIUM
DescriptionTitle word boost. Gives more weight to words that appeared in org.carrot2.core.Document.TITLE fields.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.0
Max value 10.0
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#titleWordsBoost()

12.3.7 Matrix model

Factorization quality

Key TermDocumentMatrixReducer.factorizationQuality
Direction Input
Level ADVANCED
DescriptionFactorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
Required yes
Scope Processing time
Value type org.carrot2.matrix.factorization.IterationNumberGuesser$FactorizationQuality
Default value HIGH
Allowed values
  • LOW  (Low)
  • MEDIUM  (Medium)
  • HIGH  (High)
Attribute builder TermDocumentMatrixReducerDescriptor.​AttributeBuilder#factorizationQuality()

Maximum matrix size

Key TermDocumentMatrixBuilder.maximumMatrixSize
Direction Input
Level ADVANCED
DescriptionMaximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 37500
Min value 5000
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#maximumMatrixSize()

Maximum word document frequency

Key TermDocumentMatrixBuilder.maxWordDf
Direction Input
Level ADVANCED
DescriptionMaximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

Required no
Scope Processing time
Value type java.lang.Double
Default value 0.9
Min value 0.0
Max value 1.0
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#maxWordDf()

Term weighting

Key TermDocumentMatrixBuilder.termWeighting
Direction Input
Level ADVANCED
DescriptionTerm weighting. The method for calculating weight of words in the term-document matrices.
Required yes
Scope Processing time
Value type org.carrot2.text.vsm.ITermWeighting
Default value org.carrot2.text.vsm.LogTfIdfTermWeighting
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#termWeighting()

12.3.8 Multilingual clustering

Default clustering language

Key MultilingualClustering.defaultLanguage
Direction Input
Level MEDIUM
DescriptionDefault clustering language. The default language to use for documents with undefined org.carrot2.core.Document.LANGUAGE.
Required yes
Scope Processing time
Value type org.carrot2.core.LanguageCode
Default value ENGLISH
Allowed values
  • ARABIC  (Arabic)
  • BULGARIAN  (Bulgarian)
  • CZECH  (Czech)
  • CHINESE_SIMPLIFIED  (Chinese Simplified)
  • CROATIAN  (Croatian)
  • DANISH  (Danish)
  • DUTCH  (Dutch)
  • ENGLISH  (English)
  • ESTONIAN  (Estonian)
  • FINNISH  (Finnish)
  • FRENCH  (French)
  • GERMAN  (German)
  • GREEK  (Greek)
  • HUNGARIAN  (Hungarian)
  • HINDI  (Hindi)
  • ITALIAN  (Italian)
  • IRISH  (Irish)
  • JAPANESE  (Japanese)
  • KOREAN  (Korean)
  • LATVIAN  (Latvian)
  • LITHUANIAN  (Lithuanian)
  • MALTESE  (Maltese)
  • NORWEGIAN  (Norwegian)
  • POLISH  (Polish)
  • PORTUGUESE  (Portuguese)
  • ROMANIAN  (Romanian)
  • RUSSIAN  (Russian)
  • SLOVAK  (Slovak)
  • SLOVENE  (Slovene)
  • SPANISH  (Spanish)
  • SWEDISH  (Swedish)
  • THAI  (Thai)
  • TURKISH  (Turkish)
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#defaultLanguage()

Document languages

Key MultilingualClustering.languageCounts
Direction Output
DescriptionDocument languages. The number of documents in each language. Empty string key means unknown language.
Scope Processing time
Value type java.util.Map
Default value none
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#languageCounts()

Language aggregation strategy

Key MultilingualClustering.languageAggregationStrategy
Direction Input
Level MEDIUM
DescriptionLanguage aggregation strategy. Determines how clusters generated for individual languages should be combined to form the final result. Please see org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy for the list of available options.
Required yes
Scope Processing time
Value type org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
Default value FLATTEN_MAJOR_LANGUAGE
Allowed values
  • FLATTEN_ALL  (Flatten clusters from all languages)
  • FLATTEN_MAJOR_LANGUAGE  (Flatten clusters from the majority language)
  • FLATTEN_NONE  (Dedicated parent cluster for each language)
  • CLUSTER_IN_MAJORITY_LANGUAGE  (Cluster all documents assuming the language of the majority)
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#languageAggregationStrategy()

Majority language

Key MultilingualClustering.majorityLanguage
Direction Output
DescriptionMajority language. If org.carrot2.text.clustering.MultilingualClustering.languageAggregationStrategy is org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.CLUSTER_IN_MAJORITY_LANGUAGE, this attribute will provide the majority language that was used to cluster all the documents. If the majority of the documents have undefined language, this attribute will be empty and the clustering will be performed in the org.carrot2.text.clustering.MultilingualClustering.defaultLanguage.
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#majorityLanguage()

12.3.9 Preprocessing

Document fields

Key Tokenizer.documentFields
Direction Input
Level ADVANCED
DescriptionTextual fields of documents that should be tokenized and parsed for clustering.
Required no
Scope Initialization time
Value type java.util.Collection
Default value [title, snippet]
Attribute builder TokenizerDescriptor.​AttributeBuilder#documentFields()

Lexical data factory

Key PreprocessingPipeline.lexicalDataFactory
Direction Input
Level ADVANCED
DescriptionLexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ILexicalDataFactory
Default value org.carrot2.text.linguistic.DefaultLexicalDataFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#lexicalDataFactory()

Merge lexical resources

Key merge-resources
Direction Input
Level MEDIUM
DescriptionMerges stop words and stop labels from all known languages. If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all org.carrot2.core.LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value true
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#mergeResources()

Reload lexical resources

Key reload-resources
Direction Input
Level MEDIUM
DescriptionReloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production.

This flag is reset to false after successful resource reload to prevent multiple resource reloads during the same processing cycle.

Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#reloadResources()

Resource lookup facade

Key resource-lookup
Direction Input
Level ADVANCED
DescriptionLexical resource lookup facade. By default, resources are sought in the current thread's context class loader. An override of this attribute is possible both at the initialization time and at processing time.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.ResourceLookup
Default value org.carrot2.util.resource.ResourceLookup
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#resourceLookup()

Stemmer factory

Key PreprocessingPipeline.stemmerFactory
Direction Input
Level ADVANCED
DescriptionStemmer factory. Creates the stemmers to be used by the clustering algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.IStemmerFactory
Default value org.carrot2.text.linguistic.DefaultStemmerFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#stemmerFactory()

Tokenizer factory

Key PreprocessingPipeline.tokenizerFactory
Direction Input
Level ADVANCED
DescriptionTokenizer factory. Creates the tokenizers to be used by the clustering algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ITokenizerFactory
Default value org.carrot2.text.linguistic.DefaultTokenizerFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#tokenizerFactory()

Word document frequency threshold

Key CaseNormalizer.dfThreshold
Direction Input
Level ADVANCED
DescriptionWord Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100
Attribute builder CaseNormalizerDescriptor.​AttributeBuilder#dfThreshold()

12.3.10 Search result information

Clusters

Key clusters
Direction Output
DescriptionClusters created by the clustering algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder BisectingKMeansClusteringAlgorithmDescriptor.​AttributeBuilder#clusters()

12.3.11 Ungrouped

Common preprocessing tasks handler

12.4 Lingo Clustering

12.4.3 Clusters

Cluster count base

Key LingoClusteringAlgorithm.desiredClusterCountBase
Direction Input
Level BASIC
DescriptionDesired cluster count base. Base factor used to calculate the number of clusters based on the number of documents on input. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportional to the cluster count base, but not in a linear way.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 30
Min value 2
Max value 100
Attribute builder LingoClusteringAlgorithmDescriptor.​AttributeBuilder#desiredClusterCountBase()

Cluster merging threshold

Key LingoClusteringAlgorithm.clusterMergingThreshold
Direction Input
Level MEDIUM
DescriptionCluster merging threshold. The percentage overlap between two cluster's documents required for the clusters to be merged into one clusters. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.7
Min value 0.0
Max value 1.0
Attribute builder ClusterBuilderDescriptor.​AttributeBuilder#clusterMergingThreshold()

Size-Score sorting ratio

Key LingoClusteringAlgorithm.scoreWeight
Direction Input
Level MEDIUM
DescriptionBalance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.0
Min value 0.0
Max value 1.0
Attribute builder LingoClusteringAlgorithmDescriptor.​AttributeBuilder#scoreWeight()

12.4.4 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required yes
Scope Processing time
Value type java.util.List
Default value none
Attribute builder LingoClusteringAlgorithmDescriptor.​AttributeBuilder#documents()

12.4.5 Labels

Cluster label assignment method

Key LingoClusteringAlgorithm.labelAssigner
Direction Input
Level ADVANCED
DescriptionCluster label assignment method.
Required yes
Scope Processing time
Value type org.carrot2.clustering.lingo.ILabelAssigner
Default value org.carrot2.clustering.lingo.UniqueLabelAssigner
Allowed value types Allowed value types: No other assignable value types are allowed.
Attribute builder ClusterBuilderDescriptor.​AttributeBuilder#labelAssigner()

Phrase label boost

Key LingoClusteringAlgorithm.phraseLabelBoost
Direction Input
Level MEDIUM
DescriptionPhrase label boost. The weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels.
Required no
Scope Processing time
Value type java.lang.Double
Default value 1.5
Min value 0.0
Max value 10.0
Attribute builder ClusterBuilderDescriptor.​AttributeBuilder#phraseLabelBoost()

Phrase length penalty start

Key LingoClusteringAlgorithm.phraseLengthPenaltyStart
Direction Input
Level ADVANCED
DescriptionPhrase length penalty start. The phrase length at which the overlong multi-word labels should start to be penalized. Phrases of length smaller than phraseLengthPenaltyStart will not be penalized.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 2
Max value 8
Attribute builder ClusterBuilderDescriptor.​AttributeBuilder#phraseLengthPenaltyStart()

Phrase length penalty stop

Key LingoClusteringAlgorithm.phraseLengthPenaltyStop
Direction Input
Level ADVANCED
DescriptionPhrase length penalty stop. The phrase length at which the overlong multi-word labels should be removed completely. Phrases of length larger than phraseLengthPenaltyStop will be removed.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 2
Max value 8
Attribute builder ClusterBuilderDescriptor.​AttributeBuilder#phraseLengthPenaltyStop()

Remove labels ending in genitive form

Key GenitiveLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove labels ending in genitive form. Removes labels that do end in words in the Saxon Genitive form (e.g. "Threatening the Country's").
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder GenitiveLabelFilterDescriptor.​AttributeBuilder#enabled()

Remove leading and trailing stop words

Key StopWordLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove leading and trailing stop words. Removes labels that consist of, start or end in stop words.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder StopWordLabelFilterDescriptor.​AttributeBuilder#enabled()

Remove numeric labels

Key NumericLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove numeric labels. Remove labels that consist only of or start with numbers.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder NumericLabelFilterDescriptor.​AttributeBuilder#enabled()

Remove query words

Key QueryLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove query words. Removes labels that consist only of words contained in the query.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder QueryLabelFilterDescriptor.​AttributeBuilder#enabled()

Remove short labels

Key MinLengthLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove labels shorter than 3 characters. Removes labels whose total length in characters, including spaces, is less than 3.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder MinLengthLabelFilterDescriptor.​AttributeBuilder#enabled()

Remove stop labels

Key StopLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove stop labels. Removes labels that are declared as stop labels in the stoplabels.<lang> files. Please note that adding a long list of regular expressions to the stoplabels file may result in a noticeable performance penalty.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder StopLabelFilterDescriptor.​AttributeBuilder#enabled()

Remove truncated phrases

Key CompleteLabelFilter.enabled
Direction Input
Level BASIC
DescriptionRemove truncated phrases. Tries to remove "incomplete" cluster labels. For example, in a collection of documents related to Data Mining, the phrase Conference on Data is incomplete in a sense that most likely it should be Conference on Data Mining or even Conference on Data Mining in Large Databases. When truncated phrase removal is enabled, the algorithm would try to remove the "incomplete" phrases like the former one and leave only the more informative variants.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder CompleteLabelFilterDescriptor.​AttributeBuilder#enabled()

Title word boost

Key TermDocumentMatrixBuilder.titleWordsBoost
Direction Input
Level MEDIUM
DescriptionTitle word boost. Gives more weight to words that appeared in org.carrot2.core.Document.TITLE fields.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.0
Max value 10.0
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#titleWordsBoost()

Truncated label threshold

Key CompleteLabelFilter.labelOverrideThreshold
Direction Input
Level ADVANCED
DescriptionTruncated label threshold. Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.65
Min value 0.0
Max value 1.0
Attribute builder CompleteLabelFilterDescriptor.​AttributeBuilder#labelOverrideThreshold()

12.4.6 Matrix model

Factorization quality

Key TermDocumentMatrixReducer.factorizationQuality
Direction Input
Level ADVANCED
DescriptionFactorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering.
Required yes
Scope Processing time
Value type org.carrot2.matrix.factorization.IterationNumberGuesser$FactorizationQuality
Default value HIGH
Allowed values
  • LOW  (Low)
  • MEDIUM  (Medium)
  • HIGH  (High)
Attribute builder TermDocumentMatrixReducerDescriptor.​AttributeBuilder#factorizationQuality()

Maximum matrix size

Key TermDocumentMatrixBuilder.maximumMatrixSize
Direction Input
Level ADVANCED
DescriptionMaximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 37500
Min value 5000
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#maximumMatrixSize()

Maximum word document frequency

Key TermDocumentMatrixBuilder.maxWordDf
Direction Input
Level ADVANCED
DescriptionMaximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

Required no
Scope Processing time
Value type java.lang.Double
Default value 0.9
Min value 0.0
Max value 1.0
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#maxWordDf()

Term weighting

Key TermDocumentMatrixBuilder.termWeighting
Direction Input
Level ADVANCED
DescriptionTerm weighting. The method for calculating weight of words in the term-document matrices.
Required yes
Scope Processing time
Value type org.carrot2.text.vsm.ITermWeighting
Default value org.carrot2.text.vsm.LogTfIdfTermWeighting
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder TermDocumentMatrixBuilderDescriptor.​AttributeBuilder#termWeighting()

12.4.7 Multilingual clustering

Default clustering language

Key MultilingualClustering.defaultLanguage
Direction Input
Level MEDIUM
DescriptionDefault clustering language. The default language to use for documents with undefined org.carrot2.core.Document.LANGUAGE.
Required yes
Scope Processing time
Value type org.carrot2.core.LanguageCode
Default value ENGLISH
Allowed values
  • ARABIC  (Arabic)
  • BULGARIAN  (Bulgarian)
  • CZECH  (Czech)
  • CHINESE_SIMPLIFIED  (Chinese Simplified)
  • CROATIAN  (Croatian)
  • DANISH  (Danish)
  • DUTCH  (Dutch)
  • ENGLISH  (English)
  • ESTONIAN  (Estonian)
  • FINNISH  (Finnish)
  • FRENCH  (French)
  • GERMAN  (German)
  • GREEK  (Greek)
  • HUNGARIAN  (Hungarian)
  • HINDI  (Hindi)
  • ITALIAN  (Italian)
  • IRISH  (Irish)
  • JAPANESE  (Japanese)
  • KOREAN  (Korean)
  • LATVIAN  (Latvian)
  • LITHUANIAN  (Lithuanian)
  • MALTESE  (Maltese)
  • NORWEGIAN  (Norwegian)
  • POLISH  (Polish)
  • PORTUGUESE  (Portuguese)
  • ROMANIAN  (Romanian)
  • RUSSIAN  (Russian)
  • SLOVAK  (Slovak)
  • SLOVENE  (Slovene)
  • SPANISH  (Spanish)
  • SWEDISH  (Swedish)
  • THAI  (Thai)
  • TURKISH  (Turkish)
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#defaultLanguage()

Document languages

Key MultilingualClustering.languageCounts
Direction Output
DescriptionDocument languages. The number of documents in each language. Empty string key means unknown language.
Scope Processing time
Value type java.util.Map
Default value none
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#languageCounts()

Language aggregation strategy

Key MultilingualClustering.languageAggregationStrategy
Direction Input
Level MEDIUM
DescriptionLanguage aggregation strategy. Determines how clusters generated for individual languages should be combined to form the final result. Please see org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy for the list of available options.
Required yes
Scope Processing time
Value type org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
Default value FLATTEN_MAJOR_LANGUAGE
Allowed values
  • FLATTEN_ALL  (Flatten clusters from all languages)
  • FLATTEN_MAJOR_LANGUAGE  (Flatten clusters from the majority language)
  • FLATTEN_NONE  (Dedicated parent cluster for each language)
  • CLUSTER_IN_MAJORITY_LANGUAGE  (Cluster all documents assuming the language of the majority)
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#languageAggregationStrategy()

Majority language

Key MultilingualClustering.majorityLanguage
Direction Output
DescriptionMajority language. If org.carrot2.text.clustering.MultilingualClustering.languageAggregationStrategy is org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.CLUSTER_IN_MAJORITY_LANGUAGE, this attribute will provide the majority language that was used to cluster all the documents. If the majority of the documents have undefined language, this attribute will be empty and the clustering will be performed in the org.carrot2.text.clustering.MultilingualClustering.defaultLanguage.
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#majorityLanguage()

12.4.8 Phrase extraction

Phrase document frequency threshold

Key PhraseExtractor.dfThreshold
Direction Input
Level ADVANCED
DescriptionPhrase Document Frequency threshold. Phrases appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100
Attribute builder PhraseExtractorDescriptor.​AttributeBuilder#dfThreshold()

12.4.9 Preprocessing

Document fields

Key Tokenizer.documentFields
Direction Input
Level ADVANCED
DescriptionTextual fields of documents that should be tokenized and parsed for clustering.
Required no
Scope Initialization time
Value type java.util.Collection
Default value [title, snippet]
Attribute builder TokenizerDescriptor.​AttributeBuilder#documentFields()

Exact phrase assignment

Key DocumentAssigner.exactPhraseAssignment
Direction Input
Level MEDIUM
DescriptionOnly exact phrase assignments. Assign only documents that contain the label in its original form, including the order of words. Enabling this option will cause less documents to be put in clusters, which result in higher precision of assignment, but also a larger "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder DocumentAssignerDescriptor.​AttributeBuilder#exactPhraseAssignment()

Lexical data factory

Key PreprocessingPipeline.lexicalDataFactory
Direction Input
Level ADVANCED
DescriptionLexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ILexicalDataFactory
Default value org.carrot2.text.linguistic.DefaultLexicalDataFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#lexicalDataFactory()

Merge lexical resources

Key merge-resources
Direction Input
Level MEDIUM
DescriptionMerges stop words and stop labels from all known languages. If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all org.carrot2.core.LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value true
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#mergeResources()

Minimum cluster size

Key DocumentAssigner.minClusterSize
Direction Input
Level MEDIUM
DescriptionDetermines the minimum number of documents in each cluster.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 1
Max value 100
Attribute builder DocumentAssignerDescriptor.​AttributeBuilder#minClusterSize()

Reload lexical resources

Key reload-resources
Direction Input
Level MEDIUM
DescriptionReloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production.

This flag is reset to false after successful resource reload to prevent multiple resource reloads during the same processing cycle.

Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#reloadResources()

Resource lookup facade

Key resource-lookup
Direction Input
Level ADVANCED
DescriptionLexical resource lookup facade. By default, resources are sought in the current thread's context class loader. An override of this attribute is possible both at the initialization time and at processing time.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.ResourceLookup
Default value org.carrot2.util.resource.ResourceLookup
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#resourceLookup()

Stemmer factory

Key PreprocessingPipeline.stemmerFactory
Direction Input
Level ADVANCED
DescriptionStemmer factory. Creates the stemmers to be used by the clustering algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.IStemmerFactory
Default value org.carrot2.text.linguistic.DefaultStemmerFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#stemmerFactory()

Tokenizer factory

Key PreprocessingPipeline.tokenizerFactory
Direction Input
Level ADVANCED
DescriptionTokenizer factory. Creates the tokenizers to be used by the clustering algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ITokenizerFactory
Default value org.carrot2.text.linguistic.DefaultTokenizerFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#tokenizerFactory()

Word document frequency threshold

Key CaseNormalizer.dfThreshold
Direction Input
Level ADVANCED
DescriptionWord Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100
Attribute builder CaseNormalizerDescriptor.​AttributeBuilder#dfThreshold()

12.4.10 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder LingoClusteringAlgorithmDescriptor.​AttributeBuilder#query()

12.4.11 Search result information

Clusters

Key clusters
Direction Output
DescriptionClusters created by the clustering algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder LingoClusteringAlgorithmDescriptor.​AttributeBuilder#clusters()

12.4.12 Ungrouped

Common preprocessing tasks handler, contains bindable attributes

Key LingoClusteringAlgorithm.preprocessingPipeline
Direction Input
Level ADVANCED
DescriptionCommon preprocessing tasks handler, contains bindable attributes.
Required no
Scope Initialization time
Value type org.carrot2.text.preprocessing.pipeline.IPreprocessingPipeline
Default value org.carrot2.text.preprocessing.pipeline.CompletePreprocessingPipeline
Attribute builder LingoClusteringAlgorithmDescriptor.​AttributeBuilder#preprocessingPipeline()

12.5 Suffix Tree Clustering

12.5.3 Base clusters

Document count boost

Key STCClusteringAlgorithm.documentCountBoost
Direction Input
Level MEDIUM
DescriptionDocument count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster.
Required no
Scope Processing time
Value type java.lang.Double
Default value 1.0
Min value 0.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#documentCountBoost()

Maximum base clusters count

Key STCClusteringAlgorithm.maxBaseClusters
Direction Input
Level ADVANCED
DescriptionMaximum base clusters count. Trims the base cluster array after N-th position for the merging phase.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 300
Min value 2
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#maxBaseClusters()

Minimum base cluster score

Key STCClusteringAlgorithm.minBaseClusterScore
Direction Input
Level ADVANCED
DescriptionMinimum base cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.0
Max value 10.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#minBaseClusterScore()

Minimum documents per base cluster

Key STCClusteringAlgorithm.minBaseClusterSize
Direction Input
Level ADVANCED
DescriptionMinimum documents per base cluster.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 2
Max value 20
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#minBaseClusterSize()

Optimal label length

Key STCClusteringAlgorithm.optimalPhraseLength
Direction Input
Level BASIC
DescriptionOptimal label length. A factor in calculation of the base cluster score.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 3
Min value 1
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#optimalPhraseLength()

Phrase length tolerance

Key STCClusteringAlgorithm.optimalPhraseLengthDev
Direction Input
Level MEDIUM
DescriptionPhrase length tolerance. A factor in calculation of the base cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 2.0
Min value 0.5
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#optimalPhraseLengthDev()

Single term boost

Key STCClusteringAlgorithm.singleTermBoost
Direction Input
Level MEDIUM
DescriptionSingle term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.5
Min value 0.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#singleTermBoost()

12.5.4 Clusters

Merge all stem-equivalent phrases when discovering base clusters

Key STCClusteringAlgorithm.mergeStemEquivalentBaseClusters
Direction Input
Level MEDIUM
DescriptionMerge all stem-equivalent base clusters before running the merge phase.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#mergeStemEquivalentBaseClusters()

Size-Score sorting ratio

Key STCClusteringAlgorithm.scoreWeight
Direction Input
Level MEDIUM
DescriptionBalance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score.
Required no
Scope Processing time
Value type java.lang.Double
Default value 1.0
Min value 0.0
Max value 1.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#scoreWeight()

12.5.5 Documents

Documents

Key documents
Direction Input
Level BASIC
DescriptionDocuments to cluster.
Required yes
Scope Processing time
Value type java.util.List
Default value none
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#documents()

12.5.6 Labels

Maximum cluster phrase overlap

Key STCClusteringAlgorithm.maxPhraseOverlap
Direction Input
Level ADVANCED
DescriptionMaximum cluster phrase overlap.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.6
Min value 0.0
Max value 1.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#maxPhraseOverlap()

Maximum phrases per label

Key STCClusteringAlgorithm.maxPhrases
Direction Input
Level BASIC
DescriptionMaximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 3
Min value 1
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#maxPhrases()

Maximum words per label

Key STCClusteringAlgorithm.maxDescPhraseLength
Direction Input
Level BASIC
DescriptionMaximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 4
Min value 1
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#maxDescPhraseLength()

Minimum general phrase coverage

Key STCClusteringAlgorithm.mostGeneralPhraseCoverage
Direction Input
Level ADVANCED
DescriptionMinimum general phrase coverage. Minimum phrase coverage to appear in cluster description.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.5
Min value 0.0
Max value 1.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#mostGeneralPhraseCoverage()

12.5.7 Merging and output

Base cluster merge threshold

Key STCClusteringAlgorithm.mergeThreshold
Direction Input
Level ADVANCED
DescriptionBase cluster merge threshold.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.6
Min value 0.0
Max value 1.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#mergeThreshold()

Maximum final clusters

Key STCClusteringAlgorithm.maxClusters
Direction Input
Level BASIC
DescriptionMaximum final clusters.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 15
Min value 1
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#maxClusters()

12.5.8 Multilingual clustering

Default clustering language

Key MultilingualClustering.defaultLanguage
Direction Input
Level MEDIUM
DescriptionDefault clustering language. The default language to use for documents with undefined org.carrot2.core.Document.LANGUAGE.
Required yes
Scope Processing time
Value type org.carrot2.core.LanguageCode
Default value ENGLISH
Allowed values
  • ARABIC  (Arabic)
  • BULGARIAN  (Bulgarian)
  • CZECH  (Czech)
  • CHINESE_SIMPLIFIED  (Chinese Simplified)
  • CROATIAN  (Croatian)
  • DANISH  (Danish)
  • DUTCH  (Dutch)
  • ENGLISH  (English)
  • ESTONIAN  (Estonian)
  • FINNISH  (Finnish)
  • FRENCH  (French)
  • GERMAN  (German)
  • GREEK  (Greek)
  • HUNGARIAN  (Hungarian)
  • HINDI  (Hindi)
  • ITALIAN  (Italian)
  • IRISH  (Irish)
  • JAPANESE  (Japanese)
  • KOREAN  (Korean)
  • LATVIAN  (Latvian)
  • LITHUANIAN  (Lithuanian)
  • MALTESE  (Maltese)
  • NORWEGIAN  (Norwegian)
  • POLISH  (Polish)
  • PORTUGUESE  (Portuguese)
  • ROMANIAN  (Romanian)
  • RUSSIAN  (Russian)
  • SLOVAK  (Slovak)
  • SLOVENE  (Slovene)
  • SPANISH  (Spanish)
  • SWEDISH  (Swedish)
  • THAI  (Thai)
  • TURKISH  (Turkish)
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#defaultLanguage()

Document languages

Key MultilingualClustering.languageCounts
Direction Output
DescriptionDocument languages. The number of documents in each language. Empty string key means unknown language.
Scope Processing time
Value type java.util.Map
Default value none
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#languageCounts()

Language aggregation strategy

Key MultilingualClustering.languageAggregationStrategy
Direction Input
Level MEDIUM
DescriptionLanguage aggregation strategy. Determines how clusters generated for individual languages should be combined to form the final result. Please see org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy for the list of available options.
Required yes
Scope Processing time
Value type org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
Default value FLATTEN_MAJOR_LANGUAGE
Allowed values
  • FLATTEN_ALL  (Flatten clusters from all languages)
  • FLATTEN_MAJOR_LANGUAGE  (Flatten clusters from the majority language)
  • FLATTEN_NONE  (Dedicated parent cluster for each language)
  • CLUSTER_IN_MAJORITY_LANGUAGE  (Cluster all documents assuming the language of the majority)
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#languageAggregationStrategy()

Majority language

Key MultilingualClustering.majorityLanguage
Direction Output
DescriptionMajority language. If org.carrot2.text.clustering.MultilingualClustering.languageAggregationStrategy is org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.CLUSTER_IN_MAJORITY_LANGUAGE, this attribute will provide the majority language that was used to cluster all the documents. If the majority of the documents have undefined language, this attribute will be empty and the clustering will be performed in the org.carrot2.text.clustering.MultilingualClustering.defaultLanguage.
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder MultilingualClusteringDescriptor.​AttributeBuilder#majorityLanguage()

12.5.9 Preprocessing

Document fields

Key Tokenizer.documentFields
Direction Input
Level ADVANCED
DescriptionTextual fields of documents that should be tokenized and parsed for clustering.
Required no
Scope Initialization time
Value type java.util.Collection
Default value [title, snippet]
Attribute builder TokenizerDescriptor.​AttributeBuilder#documentFields()

Lexical data factory

Key PreprocessingPipeline.lexicalDataFactory
Direction Input
Level ADVANCED
DescriptionLexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ILexicalDataFactory
Default value org.carrot2.text.linguistic.DefaultLexicalDataFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#lexicalDataFactory()

Merge lexical resources

Key merge-resources
Direction Input
Level MEDIUM
DescriptionMerges stop words and stop labels from all known languages. If set to false, only stop words and stop labels of the active language will be used. If set to true, stop words from all org.carrot2.core.LanguageCodes will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value true
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#mergeResources()

Reload lexical resources

Key reload-resources
Direction Input
Level MEDIUM
DescriptionReloads cached stop words and stop labels on every processing request. For best performance, lexical resource reloading should be disabled in production.

This flag is reset to false after successful resource reload to prevent multiple resource reloads during the same processing cycle.

Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#reloadResources()

Resource lookup facade

Key resource-lookup
Direction Input
Level ADVANCED
DescriptionLexical resource lookup facade. By default, resources are sought in the current thread's context class loader. An override of this attribute is possible both at the initialization time and at processing time.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.ResourceLookup
Default value org.carrot2.util.resource.ResourceLookup
Attribute builder DefaultLexicalDataFactoryDescriptor.​AttributeBuilder#resourceLookup()

Stemmer factory

Key PreprocessingPipeline.stemmerFactory
Direction Input
Level ADVANCED
DescriptionStemmer factory. Creates the stemmers to be used by the clustering algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.IStemmerFactory
Default value org.carrot2.text.linguistic.DefaultStemmerFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#stemmerFactory()

Tokenizer factory

Key PreprocessingPipeline.tokenizerFactory
Direction Input
Level ADVANCED
DescriptionTokenizer factory. Creates the tokenizers to be used by the clustering algorithm.
Required no
Scope Initialization time and Processing time
Value type org.carrot2.text.linguistic.ITokenizerFactory
Default value org.carrot2.text.linguistic.DefaultTokenizerFactory
Attribute builder BasicPreprocessingPipelineDescriptor.​AttributeBuilder#tokenizerFactory()

Word document frequency threshold

Key CaseNormalizer.dfThreshold
Direction Input
Level ADVANCED
DescriptionWord Document Frequency threshold. Words appearing in fewer than dfThreshold documents will be ignored.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Max value 100
Attribute builder CaseNormalizerDescriptor.​AttributeBuilder#dfThreshold()

12.5.10 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#query()

12.5.11 Search result information

Clusters

Key clusters
Direction Output
DescriptionClusters created by the algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#clusters()

12.5.12 Word filtering

Maximum word-document ratio

Key STCClusteringAlgorithm.ignoreWordIfInHigherDocsPercent
Direction Input
Level MEDIUM
DescriptionMaximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored.
Required no
Scope Processing time
Value type java.lang.Double
Default value 0.9
Min value 0.0
Max value 1.0
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#ignoreWordIfInHigherDocsPercent()

Minimum word-document recurrences

Key STCClusteringAlgorithm.ignoreWordIfInFewerDocs
Direction Input
Level MEDIUM
DescriptionMinimum word-document recurrences.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 2
Min value 2
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#ignoreWordIfInFewerDocs()

12.5.13 Ungrouped

Common preprocessing tasks handler

Key STCClusteringAlgorithm.preprocessingPipeline
Direction Input
Level ADVANCED
DescriptionCommon preprocessing tasks handler.
Required no
Scope Initialization time
Value type org.carrot2.text.preprocessing.pipeline.IPreprocessingPipeline
Default value org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder STCClusteringAlgorithmDescriptor.​AttributeBuilder#preprocessingPipeline()

12.6 eTools Metasearch Engine

eTools document source searches the web using etools.ch metasearch engine

12.6.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.6.4 Filtering

Country

Key EToolsDocumentSource.country
Direction Input
Level MEDIUM
DescriptionDetermines the country of origin for the returned search results.
Required no
Scope Processing time
Value type org.carrot2.source.etools.EToolsDocumentSource$Country
Default value ALL
Allowed values
  • ALL  (All)
  • AUSTRIA  (Austria)
  • FRANCE  (France)
  • GERMANY  (Germany)
  • GREAT_BRITAIN  (Great Britain)
  • ITALY  (Italy)
  • LICHTENSTEIN  (Lichtenstein)
  • SPAIN  (Spain)
  • SWITZERLAND  (Switzerland)
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#country()

Language

Key EToolsDocumentSource.language
Direction Input
Level MEDIUM
DescriptionDetermines the language of the returned search results.
Required no
Scope Processing time
Value type org.carrot2.source.etools.EToolsDocumentSource$Language
Default value ENGLISH
Allowed values
  • ALL  (All)
  • ENGLISH  (English)
  • FRENCH  (French)
  • GERMAN  (German)
  • ITALIAN  (Italian)
  • SPANISH  (Spanish)
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#language()

Safe search

Key EToolsDocumentSource.safeSearch
Direction Input
Level BASIC
DescriptionIf enabled, excludes offensive content from the results.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#safeSearch()

Site restriction

Key EToolsDocumentSource.site
Direction Input
Level ADVANCED
DescriptionSite URL or comma-separated list of site site URLs to which the returned results should be restricted. For example: wikipedia.org or en.wikipedia.org,de.wikipedia.org. Very larger lists of site restrictions (larger than 2000 characters) may result in a processing exception.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#site()

12.6.5 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.6.6 Search result information

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.6.7 Service

Customer ID

Key EToolsDocumentSource.customerId
Direction Input
Level MEDIUM
DescriptioneTools customer identifier. For commercial use of eTools, please e-mail: contact@comcepta.com to obtain your customer identifier.
Required no
Scope Processing time
Value type java.lang.String
Default value
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#customerId()

Data sources

Key EToolsDocumentSource.dataSources
Direction Input
Level ADVANCED
DescriptionDetermines which data sources to search.
Required no
Scope Processing time
Value type org.carrot2.source.etools.EToolsDocumentSource$DataSources
Default value ALL
Allowed values
  • ALL  (All)
  • FASTEST  (Fastest)
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#dataSources()

Data transfer timeout

Key XmlDocumentSourceHelper.timeout
Direction Input
Level ADVANCED
DescriptionData transfer timeout. Specifies the data transfer timeout, in seconds. A timeout value of zero is interpreted as an infinite timeout.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 0
Max value 300
Attribute builder XmlDocumentSourceHelperDescriptor.​AttributeBuilder#timeout()

HTTP redirect strategy

Key org.carrot2.source.xml.RemoteXmlSimpleSearchEngineBase.redirectStrategy
Direction Input
Level MEDIUM
DescriptionHTTP redirect response strategy (follow or throw an error).
Required no
Scope Processing time
Value type org.carrot2.util.httpclient.HttpRedirectStrategy
Default value NO_REDIRECTS
Allowed values
  • NO_REDIRECTS  (NO_REDIRECTS)
  • FOLLOW  (FOLLOW)
Attribute builder RemoteXmlSimpleSearchEngineBaseDescriptor.​AttributeBuilder#redirectStrategy()

Partner ID

Key EToolsDocumentSource.partnerId
Direction Input
Level ADVANCED
DescriptioneTools partner identifier. If you have commercial arrangements with eTools, specify your partner id here.
Required no
Scope Processing time
Value type java.lang.String
Default value Carrot2
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#partnerId()

Service URL

Key EToolsDocumentSource.serviceUrlBase
Direction Input
Level ADVANCED
DescriptionBase URL for the eTools service.
Required no
Scope Processing time
Value type java.lang.String
Default value https://www.etools.ch/partnerSearch.do
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#serviceUrlBase()

Timeout

Key EToolsDocumentSource.timeout
Direction Input
Level ADVANCED
DescriptionMaximum time in milliseconds to wait for all data sources to return results.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 4000
Min value 0
Attribute builder EToolsDocumentSourceDescriptor.​AttributeBuilder#timeout()

12.7 Bing Web Search

Searches the Web using Bing Search

12.7.3 Data source paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value CONSERVATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)
Attribute builder MultipageSearchEngineDescriptor.​AttributeBuilder#searchMode()

12.7.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.7.5 Filtering

Market

Key Bing7DocumentSource.market
Direction Input
Level BASIC
DescriptionLanguage and country/region information for the request.
Required no
Scope Processing time
Value type org.carrot2.source.microsoft.v7.MarketOption
Default value ENGLISH_UNITED_STATES
Allowed values
  • ARABIC_ARABIA  (Arabic – Arabia)
  • BULGARIAN_BULGARIA  (Bulgarian – Bulgaria)
  • CHINESE_CHINA  (Chinese – China)
  • CHINESE_HONG_KONG_SAR  (Chinese – Hong Kong SAR)
  • CHINESE_TAIWAN  (Chinese – Taiwan)
  • CROATIAN_CROATIA  (Croatian – Croatia)
  • CZECH_CZECH_REPUBLIC  (Czech – Czech Republic)
  • DANISH_DENMARK  (Danish – Denmark)
  • DUTCH_BELGIUM  (Dutch – Belgium)
  • DUTCH_NETHERLANDS  (Dutch – Netherlands)
  • ENGLISH_AUSTRALIA  (English – Australia)
  • ENGLISH_ARABIA  (English – Arabia)
  • ENGLISH_CANADA  (English – Canada)
  • ENGLISH_INDIA  (English – India)
  • ENGLISH_INDONESIA  (English – Indonesia)
  • ENGLISH_IRELAND  (English – Ireland)
  • ENGLISH_MALAYSIA  (English – Malaysia)
  • ENGLISH_NEW_ZEALAND  (English – New Zealand)
  • ENGLISH_PHILIPPINES  (English – Philippines)
  • ENGLISH_SINGAPORE  (English – Singapore)
  • ENGLISH_SOUTH_AFRICA  (English – South Africa)
  • ENGLISH_UNITED_KINGDOM  (English – United Kingdom)
  • ENGLISH_UNITED_STATES  (English – United States)
  • ESTONIAN_ESTONIA  (Estonian – Estonia)
  • FINNISH_FINLAND  (Finnish – Finland)
  • FRENCH_BELGIUM  (French – Belgium)
  • FRENCH_FRANCE  (French – France)
  • FRENCH_CANADA  (French – Canada)
  • FRENCH_SWITZERLAND  (French – Switzerland)
  • GERMAN_AUSTRIA  (German – Austria)
  • GERMAN_GERMANY  (German – Germany)
  • GERMAN_SWITZERLAND  (German – Switzerland)
  • GREEK_GREECE  (Greek – Greece)
  • HEBREW_ISRAEL  (Hebrew – Israel)
  • HUNGARIAN_HUNGARY  (Hungarian – Hungary)
  • ITALIAN_ITALY  (Italian – Italy)
  • JAPANESE_JAPAN  (Japanese – Japan)
  • KOREAN_KOREA  (Korean – Korea)
  • LATVIAN_LATVIA  (Latvian – Latvia)
  • LITHUANIAN_LITHUANIA  (Lithuanian – Lithuania)
  • NORWEGIAN_NORWAY  (Norwegian – Norway)
  • POLISH_POLAND  (Polish – Poland)
  • PORTUGUESE_BRAZIL  (Portuguese – Brazil)
  • PORTUGUESE_PORTUGAL  (Portuguese – Portugal)
  • ROMANIAN_ROMANIA  (Romanian – Romania)
  • RUSSIAN_RUSSIA  (Russian – Russia)
  • SLOVAK_SLOVAK_REPUBLIC  (Slovak – Slovak Republic)
  • SLOVENIAN_SLOVENIA  (Slovenian – Slovenia)
  • SPANISH_ARGENTINA  (Spanish – Argentina)
  • SPANISH_CHILE  (Spanish – Chile)
  • SPANISH_LATIN_AMERICA  (Spanish – Latin America)
  • SPANISH_MEXICO  (Spanish – Mexico)
  • SPANISH_SPAIN  (Spanish – Spain)
  • SPANISH_UNITED_STATES  (Spanish – United States)
  • SWEDISH_SWEDEN  (Swedish – Sweden)
  • THAI_THAILAND  (Thai – Thailand)
  • TURKISH_TURKEY  (Turkish – Turkey)
  • UKRAINIAN_UKRAINE  (Ukrainian – Ukraine)
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#market()

Safe search

Key Bing7DocumentSource.adult
Direction Input
Level MEDIUM
DescriptionAdult search restriction (porn filter).
Required no
Scope Processing time
Value type org.carrot2.source.microsoft.v7.AdultOption
Default value none
Allowed values
  • OFF  (Off)
  • MODERATE  (Moderate)
  • STRICT  (Strict)
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#adult()

Site restriction

Key Bing7DocumentSource.site
Direction Input
Level ADVANCED
DescriptionSite restriction to return value under a given URL. Example: http://www.wikipedia.org or simply wikipedia.org.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#site()

12.7.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.7.7 Search result information

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.7.8 Service

Application API key

Key Bing7DocumentSource.apiKey
Direction Input
Level BASIC
DescriptionThe API key used to authenticate requests. You will have to provide your own API key. There is a free monthly grace request limit.

By default takes the system property's value under key: bing7.key.

Required yes
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#apiKey()

HTTP redirect strategy

Key Bing7DocumentSource.redirectStrategy
Direction Input
Level MEDIUM
DescriptionHTTP redirect response strategy (follow or throw an error).
Required no
Scope Processing time
Value type org.carrot2.util.httpclient.HttpRedirectStrategy
Default value NO_REDIRECTS
Allowed values
  • NO_REDIRECTS  (NO_REDIRECTS)
  • FOLLOW  (FOLLOW)
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#redirectStrategy()

Respect request rate limits

Key Bing7DocumentSource.respectRateLimits
Direction Input
Level ADVANCED
DescriptionRespect official guidelines concerning rate limits. If set to false, rate limits are not observed.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#respectRateLimits()

12.8 Bing News Search

Searches news using Bing Search

12.8.3 Data source paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value CONSERVATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)
Attribute builder MultipageSearchEngineDescriptor.​AttributeBuilder#searchMode()

12.8.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.8.5 Filtering

Filter news by age

Key Bing7NewsDocumentSource.freshness
Direction Input
Level BASIC
DescriptionFilter news by age.
Required no
Scope Processing time
Value type org.carrot2.source.microsoft.v7.Freshness
Default value none
Allowed values
  • DAY  (DAY)
  • WEEK  (WEEK)
  • MONTH  (MONTH)
Attribute builder Bing7NewsDocumentSourceDescriptor.​AttributeBuilder#freshness()

Market

Key Bing7DocumentSource.market
Direction Input
Level BASIC
DescriptionLanguage and country/region information for the request.
Required no
Scope Processing time
Value type org.carrot2.source.microsoft.v7.MarketOption
Default value ENGLISH_UNITED_STATES
Allowed values
  • ARABIC_ARABIA  (Arabic – Arabia)
  • BULGARIAN_BULGARIA  (Bulgarian – Bulgaria)
  • CHINESE_CHINA  (Chinese – China)
  • CHINESE_HONG_KONG_SAR  (Chinese – Hong Kong SAR)
  • CHINESE_TAIWAN  (Chinese – Taiwan)
  • CROATIAN_CROATIA  (Croatian – Croatia)
  • CZECH_CZECH_REPUBLIC  (Czech – Czech Republic)
  • DANISH_DENMARK  (Danish – Denmark)
  • DUTCH_BELGIUM  (Dutch – Belgium)
  • DUTCH_NETHERLANDS  (Dutch – Netherlands)
  • ENGLISH_AUSTRALIA  (English – Australia)
  • ENGLISH_ARABIA  (English – Arabia)
  • ENGLISH_CANADA  (English – Canada)
  • ENGLISH_INDIA  (English – India)
  • ENGLISH_INDONESIA  (English – Indonesia)
  • ENGLISH_IRELAND  (English – Ireland)
  • ENGLISH_MALAYSIA  (English – Malaysia)
  • ENGLISH_NEW_ZEALAND  (English – New Zealand)
  • ENGLISH_PHILIPPINES  (English – Philippines)
  • ENGLISH_SINGAPORE  (English – Singapore)
  • ENGLISH_SOUTH_AFRICA  (English – South Africa)
  • ENGLISH_UNITED_KINGDOM  (English – United Kingdom)
  • ENGLISH_UNITED_STATES  (English – United States)
  • ESTONIAN_ESTONIA  (Estonian – Estonia)
  • FINNISH_FINLAND  (Finnish – Finland)
  • FRENCH_BELGIUM  (French – Belgium)
  • FRENCH_FRANCE  (French – France)
  • FRENCH_CANADA  (French – Canada)
  • FRENCH_SWITZERLAND  (French – Switzerland)
  • GERMAN_AUSTRIA  (German – Austria)
  • GERMAN_GERMANY  (German – Germany)
  • GERMAN_SWITZERLAND  (German – Switzerland)
  • GREEK_GREECE  (Greek – Greece)
  • HEBREW_ISRAEL  (Hebrew – Israel)
  • HUNGARIAN_HUNGARY  (Hungarian – Hungary)
  • ITALIAN_ITALY  (Italian – Italy)
  • JAPANESE_JAPAN  (Japanese – Japan)
  • KOREAN_KOREA  (Korean – Korea)
  • LATVIAN_LATVIA  (Latvian – Latvia)
  • LITHUANIAN_LITHUANIA  (Lithuanian – Lithuania)
  • NORWEGIAN_NORWAY  (Norwegian – Norway)
  • POLISH_POLAND  (Polish – Poland)
  • PORTUGUESE_BRAZIL  (Portuguese – Brazil)
  • PORTUGUESE_PORTUGAL  (Portuguese – Portugal)
  • ROMANIAN_ROMANIA  (Romanian – Romania)
  • RUSSIAN_RUSSIA  (Russian – Russia)
  • SLOVAK_SLOVAK_REPUBLIC  (Slovak – Slovak Republic)
  • SLOVENIAN_SLOVENIA  (Slovenian – Slovenia)
  • SPANISH_ARGENTINA  (Spanish – Argentina)
  • SPANISH_CHILE  (Spanish – Chile)
  • SPANISH_LATIN_AMERICA  (Spanish – Latin America)
  • SPANISH_MEXICO  (Spanish – Mexico)
  • SPANISH_SPAIN  (Spanish – Spain)
  • SPANISH_UNITED_STATES  (Spanish – United States)
  • SWEDISH_SWEDEN  (Swedish – Sweden)
  • THAI_THAILAND  (Thai – Thailand)
  • TURKISH_TURKEY  (Turkish – Turkey)
  • UKRAINIAN_UKRAINE  (Ukrainian – Ukraine)
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#market()

Safe search

Key Bing7DocumentSource.adult
Direction Input
Level MEDIUM
DescriptionAdult search restriction (porn filter).
Required no
Scope Processing time
Value type org.carrot2.source.microsoft.v7.AdultOption
Default value none
Allowed values
  • OFF  (Off)
  • MODERATE  (Moderate)
  • STRICT  (Strict)
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#adult()

Site restriction

Key Bing7DocumentSource.site
Direction Input
Level ADVANCED
DescriptionSite restriction to return value under a given URL. Example: http://www.wikipedia.org or simply wikipedia.org.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#site()

12.8.6 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.8.7 Search result information

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.8.8 Service

Application API key

Key Bing7DocumentSource.apiKey
Direction Input
Level BASIC
DescriptionThe API key used to authenticate requests. You will have to provide your own API key. There is a free monthly grace request limit.

By default takes the system property's value under key: bing7.key.

Required yes
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#apiKey()

HTTP redirect strategy

Key Bing7DocumentSource.redirectStrategy
Direction Input
Level MEDIUM
DescriptionHTTP redirect response strategy (follow or throw an error).
Required no
Scope Processing time
Value type org.carrot2.util.httpclient.HttpRedirectStrategy
Default value NO_REDIRECTS
Allowed values
  • NO_REDIRECTS  (NO_REDIRECTS)
  • FOLLOW  (FOLLOW)
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#redirectStrategy()

Respect request rate limits

Key Bing7DocumentSource.respectRateLimits
Direction Input
Level ADVANCED
DescriptionRespect official guidelines concerning rate limits. If set to false, rate limits are not observed.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder Bing7DocumentSourceDescriptor.​AttributeBuilder#respectRateLimits()

12.9 PubMed medical database

Searches the PubMed medical abstracts database

12.9.1 PubMed medical database input attributes by level

12.9.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.9.4 Search query

EUtils Registered Tool Name

Key PubMedDocumentSource.toolName
Direction Input
Level ADVANCED
DescriptionTool name, if registered.
Required no
Scope Initialization time
Value type java.lang.String
Default value Carrot Search
Attribute builder PubMedDocumentSourceDescriptor.​AttributeBuilder#toolName()

Maximum results

Key PubMedDocumentSource.maxResults
Direction Input
Level ADVANCED
DescriptionMaximum results to fetch. No more than the specified number of results will be fetched from PubMed, regardless of the requested number of results.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 150
Min value 1
Attribute builder PubMedDocumentSourceDescriptor.​AttributeBuilder#maxResults()

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.9.5 Search result information

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.9.6 Service

HTTP redirect strategy

Key PubMedDocumentSource.redirectStrategy
Direction Input
Level MEDIUM
DescriptionHTTP redirect response strategy (follow or throw an error).
Required no
Scope Processing time
Value type org.carrot2.util.httpclient.HttpRedirectStrategy
Default value NO_REDIRECTS
Allowed values
  • NO_REDIRECTS  (NO_REDIRECTS)
  • FOLLOW  (FOLLOW)
Attribute builder PubMedDocumentSourceDescriptor.​AttributeBuilder#redirectStrategy()

12.10 XML

XML document source retrieves documents from local XML files or remote XML streams. It can optionally apply an XSLT transformation to convert the XML to the required format.

12.10.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments read from the XML data.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#documents()

12.10.4 Search query

Query

Key query
Direction Input and Output
Level BASIC
DescriptionAfter processing this field may hold the query read from the XML data, if any. For the semantics of this field on input, see org.carrot2.source.xml.XmlDocumentSource.xml.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#query()

Read all documents

Key XmlDocumentSource.readAll
Direction Input
Level BASIC
DescriptionIf true, all documents are read from the input XML stream, regardless of the limit set by org.carrot2.source.xml.XmlDocumentSource.results.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value true
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#readAll()

Results

Key results
Direction Input
Level BASIC
DescriptionThe maximum number of documents to read from the XML data if org.carrot2.source.xml.XmlDocumentSource.readAll is false. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#results()

12.10.5 Search result information

Clusters

Key clusters
Direction Input and Output
Level BASIC
DescriptionIf org.carrot2.source.xml.XmlDocumentSource.readClusters is true and clusters are present in the input XML, they will be deserialized and exposed to components further down the processing chain.
Required no
Scope Processing time
Value type java.util.List
Default value none
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#clusters()

Title

Key processing-result.title
Direction Output
DescriptionThe title (file name or query attribute, if present) for the search result fetched from the resource. A typical title for a processing result will be the query used to fetch documents from that source. For certain document sources the query may not be needed (on-disk XML, feed of syndicated news); in such cases, the input component should set its title properly for visual interfaces such as the workbench.
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#title()

12.10.6 Service

Data transfer timeout

Key XmlDocumentSourceHelper.timeout
Direction Input
Level ADVANCED
DescriptionData transfer timeout. Specifies the data transfer timeout, in seconds. A timeout value of zero is interpreted as an infinite timeout.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 0
Max value 300
Attribute builder XmlDocumentSourceHelperDescriptor.​AttributeBuilder#timeout()

12.10.7 XML data

XML parameters

Key XmlDocumentSource.xmlParameters
Direction Input
Level ADVANCED
DescriptionValues for custom placeholders in the XML URL. If the type of resource provided in the org.carrot2.source.xml.XmlDocumentSource.xml attribute is org.carrot2.util.resource.URLResourceWithParams, this map provides values for custom placeholders found in the XML URL. Keys of the map correspond to placeholder names, values of the map will be used to replace the placeholders. Please see org.carrot2.source.xml.XmlDocumentSource.xml for the placeholder syntax.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value {}
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#xmlParameters()

XML resource

Key XmlDocumentSource.xml
Direction Input
Level BASIC
DescriptionThe resource to load XML data from. You can either create instances of org.carrot2.util.resource.IResource implementations directly or use org.carrot2.util.resource.ResourceLookup to look up org.carrot2.util.resource.IResource instances from a variety of locations.

One special org.carrot2.util.resource.IResource implementation you can use is org.carrot2.util.resource.URLResourceWithParams. It allows you to specify attribute placeholders in the URL that will be replaced with actual values at runtime. The placeholder format is ${attribute}. The following common attributes will be substituted:

  • query will be replaced with the current query being processed. If the query has not been provided, this attribute will fall back to an empty string.
  • results will be replaced with the number of results requested. If the number of results has not been provided, this attribute will be substituted with an empty string.

Additionally, custom placeholders can be used. Values for the custom placeholders should be provided in the org.carrot2.source.xml.XmlDocumentSource.xmlParameters attribute.

Required yes
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.IResource
Default value none
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#xml()

12.10.8 XML transformation

Read clusters from input

Key XmlDocumentSource.readClusters
Direction Input
Level BASIC
DescriptionIf clusters are present in the input XML they will be read and exposed to components further down the processing chain.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value false
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#readClusters()

XSLT parameters

Key XmlDocumentSource.xsltParameters
Direction Input
Level ADVANCED
DescriptionParameters to be passed to the XSLT transformer. Keys of the map will be used as parameter names, values of the map as parameter values.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value {}
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#xsltParameters()

XSLT stylesheet

Key XmlDocumentSource.xslt
Direction Input
Level MEDIUM
DescriptionThe resource to load XSLT stylesheet from. The XSLT stylesheet is optional and is useful when the source XML stream does not follow the Carrot2 format. The XSLT transformation will be applied to the source XML stream, the transformed XML stream will be deserialized into org.carrot2.core.Documents.

The XSLT org.carrot2.util.resource.IResource can be provided both on initialization and processing time. The stylesheet provided on initialization will be cached for the life time of the component, while processing-time style sheets will be compiled every time processing is requested and will override the initialization-time stylesheet.

To pass additional parameters to the XSLT transformer, use the org.carrot2.source.xml.XmlDocumentSource.xsltParameters attribute.

Required no
Scope Initialization time and Processing time
Value type org.carrot2.util.resource.IResource
Default value none
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder XmlDocumentSourceDescriptor.​AttributeBuilder#xslt()

12.11 Lucene Document Source

Retrieves documents from the Apache Lucene index. The index directory must be available in the local file system.

12.11.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#documents()

12.11.4 Highlighter

Context fragments

Key org.carrot2.source.lucene.SimpleFieldMapper.contextFragments
Direction Input
Level ADVANCED
DescriptionNumber of context fragments for the highlighter.
Required no
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 3
Min value 1
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#contextFragments()

Formatter

Key org.carrot2.source.lucene.SimpleFieldMapper.formatter
Direction Input
Level ADVANCED
DescriptionSnippet formatter for the highlighter. Highlighter is not used if null.
Required no
Scope Initialization time and Processing time
Value type org.apache.lucene.search.highlight.Formatter
Default value org.carrot2.source.lucene.PlainTextFormatter
Allowed value types Allowed value types:
Other assignable value types are allowed.
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#formatter()

Join string

Key org.carrot2.source.lucene.SimpleFieldMapper.fragmentJoin
Direction Input
Level ADVANCED
DescriptionA string used to join context fragments when highlighting.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value ...
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#fragmentJoin()

12.11.5 Index field mapping

Document content field

Key org.carrot2.source.lucene.SimpleFieldMapper.contentField
Direction Input
Level BASIC
DescriptionDocument content field name.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#contentField()

Document title field

Key org.carrot2.source.lucene.SimpleFieldMapper.titleField
Direction Input
Level BASIC
DescriptionDocument title field name.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#titleField()

Document URL field

Key org.carrot2.source.lucene.SimpleFieldMapper.urlField
Direction Input
Level BASIC
DescriptionDocument URL field name.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#urlField()

Field mapper

Key LuceneDocumentSource.fieldMapper
Direction Input
Level ADVANCED
Description IFieldMapper provides the link between Carrot2 org.carrot2.core.Document fields and Lucene index fields.
Required yes
Scope Initialization time and Processing time
Value type org.carrot2.source.lucene.IFieldMapper
Default value org.carrot2.source.lucene.SimpleFieldMapper
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#fieldMapper()

Search fields

Key org.carrot2.source.lucene.SimpleFieldMapper.searchFields
Direction Input
Level MEDIUM
DescriptionIndex search field names. If not specified, title and content fields are used.
Required no
Scope Initialization time and Processing time
Value type java.util.List
Default value none
Attribute builder SimpleFieldMapperDescriptor.​AttributeBuilder#searchFields()

12.11.6 Index properties

Analyzer

Key LuceneDocumentSource.analyzer
Direction Input
Level MEDIUM
Description org.apache.lucene.analysis.Analyzer used at indexing time. The same analyzer should be used for querying.
Required yes
Scope Initialization time and Processing time
Value type org.apache.lucene.analysis.Analyzer
Default value org.apache.lucene.analysis.standard.StandardAnalyzer
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#analyzer()

Index directory

Key LuceneDocumentSource.directory
Direction Input
Level BASIC
DescriptionSearch index org.apache.lucene.store.Directory. Must be unlocked for reading.
Required yes
Scope Initialization time and Processing time
Value type org.apache.lucene.store.Directory
Default value none
Allowed value types Allowed value types:
  • org.apache.lucene.store.RAMDirectory
  • org.apache.lucene.store.FSDirectory
Other assignable value types are allowed.
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#directory()

12.11.7 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionA pre-parsed org.apache.lucene.search.Query object or a String parsed using the built-in classic QueryParser over a set of search fields returned from the org.carrot2.source.lucene.LuceneDocumentSource.fieldMapper.
Required yes
Scope Processing time
Value type java.lang.Object
Default value none
Allowed value types Allowed value types:
  • org.apache.lucene.search.Query
  • java.lang.String
Other assignable value types are allowed.
Value contentMust not be blank
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#results()

12.11.8 Search result information

Keep Lucene documents

Key LuceneDocumentSource.keepLuceneDocuments
Direction Input
Level ADVANCED
DescriptionKeeps references to Lucene document instances in Carrot2 documents. Please bear in mind two limitations:
  • Lucene documents will not be serialized to XML/JSON. Therefore, they can only be accessed when invoking clustering through Carrot2 Java API. To pass some of the fields of Lucene documents to Carrot2 XML/JSON output, implement a custom IFieldMapper that will store those fields as regular Carrot2 fields.
  • Increased memory usage when using a org.carrot2.core.Controllerorg.carrot2.core.ControllerFactory.createCachingPooling(Class...) configured to cache the output from LuceneDocumentSource.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#keepLuceneDocuments()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder LuceneDocumentSourceDescriptor.​AttributeBuilder#resultsTotal()

12.12 Solr Search Engine

Solr document source queries an instance of Apache Solr search engine.

12.12.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.12.4 Index field mapping

Copy Solr document fields

Key SolrDocumentSource.copyFields
Direction Input
Level ADVANCED
DescriptionCopy Solr fields from the search result to Carrot2 org.carrot2.core.Document instances (as fields).
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value false
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#copyFields()

Custom XSLT adapter from Solr to Carrot2 format

Key SolrDocumentSource.solrXsltAdapter
Direction Input
Level ADVANCED
DescriptionProvides a custom XSLT stylesheet for converting from Solr's output to an XML format parsed by Carrot2. For performance reasons this attribute can be provided at initialization time only (no processing-time overrides).
Required no
Scope Initialization time
Value type org.carrot2.util.resource.IResource
Default value none
Allowed value types Allowed value types: Other assignable value types are allowed.
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#solrXsltAdapter()

ID field name

Key SolrDocumentSource.solrIdFieldName
Direction Input
Level MEDIUM
DescriptionDocument identifier field name (specified in Solr schema). This field is necessary to connect Solr-side clusters or highlighter output to documents.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#solrIdFieldName()

Read Solr clusters if present

Key SolrDocumentSource.readClusters
Direction Input
Level BASIC
DescriptionIf clusters are present in the Solr output they will be read and exposed to components further down the processing chain. Note that org.carrot2.source.solr.SolrDocumentSource.solrIdFieldName is required to match document references.
Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value false
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#readClusters()

Summary field name

Key SolrDocumentSource.solrSummaryFieldName
Direction Input
Level MEDIUM
DescriptionSummary field name. Name of the Solr field that will provide document summary.
Required no
Scope Processing time
Value type java.lang.String
Default value description
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#solrSummaryFieldName()

Title field name

Key SolrDocumentSource.solrTitleFieldName
Direction Input
Level MEDIUM
DescriptionTitle field name. Name of the Solr field that will provide document titles.
Required no
Scope Processing time
Value type java.lang.String
Default value title
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#solrTitleFieldName()

URL field name

Key SolrDocumentSource.solrUrlFieldName
Direction Input
Level MEDIUM
DescriptionURL field name. Name of the Solr field that will provide document URLs.
Required no
Scope Processing time
Value type java.lang.String
Default value url
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#solrUrlFieldName()

Use highlighter output if present

Key SolrDocumentSource.useHighlighterOutput
Direction Input
Level BASIC
DescriptionIf highlighter fragments are present in the Solr output they will be used (and preferred) over full field content. This may be used to decrease the memory required for clustering. In general if highlighter is used the contents of full fields won't be emitted from Solr though (because it makes little sense).

Setting this option to false will disable using the highlighter output entirely.

Required no
Scope Initialization time and Processing time
Value type java.lang.Boolean
Default value true
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#useHighlighterOutput()

12.12.5 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.12.6 Search result information

Clusters

Key clusters
Direction Input and Output
Level BASIC
DescriptionIf org.carrot2.source.solr.SolrDocumentSource.readClusters is true and clusters are present in the input XML, they will be deserialized and exposed to components further down the processing chain.
Required no
Scope Processing time
Value type java.util.List
Default value none
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#clusters()

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.12.7 Service

Data transfer timeout

Key XmlDocumentSourceHelper.timeout
Direction Input
Level ADVANCED
DescriptionData transfer timeout. Specifies the data transfer timeout, in seconds. A timeout value of zero is interpreted as an infinite timeout.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 8
Min value 0
Max value 300
Attribute builder XmlDocumentSourceHelperDescriptor.​AttributeBuilder#timeout()

Filter query

Key SolrDocumentSource.solrFilterQuery
Direction Input
Level MEDIUM
DescriptionFilter query appended to org.carrot2.source.solr.SolrDocumentSource.serviceUrlBase.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#solrFilterQuery()

HTTP redirect strategy

Key org.carrot2.source.xml.RemoteXmlSimpleSearchEngineBase.redirectStrategy
Direction Input
Level MEDIUM
DescriptionHTTP redirect response strategy (follow or throw an error).
Required no
Scope Processing time
Value type org.carrot2.util.httpclient.HttpRedirectStrategy
Default value NO_REDIRECTS
Allowed values
  • NO_REDIRECTS  (NO_REDIRECTS)
  • FOLLOW  (FOLLOW)
Attribute builder RemoteXmlSimpleSearchEngineBaseDescriptor.​AttributeBuilder#redirectStrategy()

Service URL

Key SolrDocumentSource.serviceUrlBase
Direction Input
Level ADVANCED
DescriptionSolr service URL base. The URL base can contain additional Solr parameters, for example: http://localhost:8983/solr/select?fq=timestemp:[NOW-24HOUR TO NOW]
Required no
Scope Processing time
Value type java.lang.String
Default value http://localhost:8983/solr/select
Attribute builder SolrDocumentSourceDescriptor.​AttributeBuilder#serviceUrlBase()

12.13 Open Search

Open Search document source retrieves search results from search engines supporting the OpenSearch standard.

12.13.3 Data source paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)
Attribute builder MultipageSearchEngineDescriptor.​AttributeBuilder#searchMode()

12.13.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.13.5 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.13.6 Search result information

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.13.7 Service

Feed URL parameters

Key OpenSearchDocumentSource.feedUrlParams
Direction Input
Level ADVANCED
DescriptionAdditional parameters to be appended to org.carrot2.source.opensearch.OpenSearchDocumentSource.feedUrlTemplate on each request.
Required no
Scope Initialization time and Processing time
Value type java.util.Map
Default value none
Attribute builder OpenSearchDocumentSourceDescriptor.​AttributeBuilder#feedUrlParams()

Feed URL template

Key OpenSearchDocumentSource.feedUrlTemplate
Direction Input
Level BASIC
DescriptionURL to fetch the search feed from. The URL template can contain variable place holders as defined by the OpenSearch specification that will be replaced during runtime. The format of the place holder is ${variable}. The following variables are supported:
  • searchTerms will be replaced by the query
  • startIndex index of the first result to be searched. Mutually exclusive with startPage
  • startPage index of the first result to be searched. Mutually exclusive with startIndex.
  • count the number of search results per page

Example URL feed templates for public services:

nature.com

http://www.nature.com/opensearch/request?interface=opensearch&operation=searchRetrieve&query=${searchTerms}&startRecord=${startIndex}&maximumRecords=${count}&httpAccept=application/rss%2Bxml

indeed.com

http://www.indeed.com/opensearch?q=${searchTerms}&start=${startIndex}&limit=${count}

Required yes
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder OpenSearchDocumentSourceDescriptor.​AttributeBuilder#feedUrlTemplate()

Maximum results

Key OpenSearchDocumentSource.maximumResults
Direction Input
Level BASIC
DescriptionMaximum number of results. The maximum number of results the document source can deliver.
Required no
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 1000
Min value 1
Attribute builder OpenSearchDocumentSourceDescriptor.​AttributeBuilder#maximumResults()

Results per page

Key OpenSearchDocumentSource.resultsPerPage
Direction Input
Level BASIC
DescriptionResults per page. The number of results per page the document source will expect the feed to return.
Required yes
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 50
Min value 1
Attribute builder OpenSearchDocumentSourceDescriptor.​AttributeBuilder#resultsPerPage()

User agent

Key OpenSearchDocumentSource.userAgent
Direction Input
Level ADVANCED
DescriptionUser agent header. The contents of the User-Agent HTTP header to use when making requests to the feed URL. If empty or null value is provided, the following User-Agent will be sent: Rome Client (http://tinyurl.com/64t5n) Ver: UNKNOWN.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder OpenSearchDocumentSourceDescriptor.​AttributeBuilder#userAgent()

12.14 IDOL Search

IDOL document source retrieves search results from Autonomy IDOL search engines supporting the OpenSearch standard.

12.14.3 Data source paging

Search Mode

Key search-mode
Direction Input
Level ADVANCED
DescriptionSearch mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
Required no
Scope Processing time
Value type org.carrot2.source.MultipageSearchEngine$SearchMode
Default value SPECULATIVE
Allowed values
  • CONSERVATIVE  (CONSERVATIVE)
  • SPECULATIVE  (SPECULATIVE)
Attribute builder MultipageSearchEngineDescriptor.​AttributeBuilder#searchMode()

12.14.4 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.Collection
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#documents()

12.14.5 Search query

Query

Key query
Direction Input
Level BASIC
DescriptionQuery to perform.
Required yes
Scope Processing time
Value type java.lang.String
Default value none
Value contentMust not be blank
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#results()

Start index

Key start
Direction Input
Level ADVANCED
DescriptionIndex of the first document/ search result to fetch. The index starts at zero.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 0
Min value 0
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#start()

12.14.6 Search result information

Compression used

Key SearchEngineBase.compressed
Direction Output
DescriptionIndicates whether the search engine returned a compressed result stream.
Scope Processing time
Value type java.lang.Boolean
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#compressed()

Page requests

Key SearchEngineStats.pageRequests
Direction Output
DescriptionNumber of individual page requests issued by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#pageRequests()

Successful queries

Key SearchEngineStats.queries
Direction Output
DescriptionNumber queries handled successfully by this data source.
Scope Processing time
Value type java.lang.Integer
Default value none
Attribute builder SearchEngineStatsDescriptor.​AttributeBuilder#queries()

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder SearchEngineBaseDescriptor.​AttributeBuilder#resultsTotal()

12.14.7 Service

IDOL server address

Key IdolDocumentSource.idolServerName
Direction Input
Level BASIC
DescriptionURL of the IDOL Server.
Required yes
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#idolServerName()

IDOL server port

Key IdolDocumentSource.idolServerPort
Direction Input
Level BASIC
DescriptionIDOL Server Port.
Required yes
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 0
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#idolServerPort()

IDOL XSL template name

Key IdolDocumentSource.xslTemplateName
Direction Input
Level ADVANCED
DescriptionIDOL XSL Template Name. The Reference of an IDOL XSL template that outputs the results in OpenSearch format.
Required yes
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#xslTemplateName()

Maximum results

Key IdolDocumentSource.maximumResults
Direction Input
Level BASIC
DescriptionMaximum number of results. The maximum number of results the document source can deliver.
Required no
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#maximumResults()

Minimum score

Key IdolDocumentSource.minScore
Direction Input
Level BASIC
DescriptionMinimum IDOL Score. The minimum score of the results returned by IDOL.
Required no
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 50
Min value 1
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#minScore()

Other IDOLSearch attributes

Key IdolDocumentSource.otherSearchAttributes
Direction Input
Level ADVANCED
DescriptionAny other search attributes (separated by &) from the Autonomy Query Search API's Ensure all the attributes are entered to satisfy XSL that will be applied.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#otherSearchAttributes()

Results per page

Key IdolDocumentSource.resultsPerPage
Direction Input
Level ADVANCED
DescriptionResults per page. The number of results per page the document source will expect the feed to return.
Required yes
Scope Initialization time and Processing time
Value type java.lang.Integer
Default value 50
Min value 1
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#resultsPerPage()

User agent

Key IdolDocumentSource.userAgent
Direction Input
Level ADVANCED
DescriptionUser agent header. The contents of the User-Agent HTTP header to use when making requests to the feed URL. If empty or null value is provided, the following User-Agent will be sent: Rome Client (http://tinyurl.com/64t5n) Ver: UNKNOWN.
Required no
Scope Initialization time and Processing time
Value type java.lang.String
Default value none
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#userAgent()

User name

Key IdolDocumentSource.userName
Direction Input
Level MEDIUM
DescriptionUser name to use for authentication.
Required no
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder IdolDocumentSourceDescriptor.​AttributeBuilder#userName()

12.15 Ambient Test Set

Serves documents from the Ambient test set. Ambient (AMBIgous ENTries) is a data set designed for evaluating subtopic information retrieval. It consists of 44 topics, each with a set of subtopics and a list of 100 ranked documents. For more information, please see: http://credo.fub.it/ambient.

12.15.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#documents()

12.15.4 Filtering

Include documents without topics

Key FubDocumentSource.includeDocumentsWithoutTopic
Direction Input
Level MEDIUM
DescriptionInclude documents without topics.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#includeDocumentsWithoutTopic()

Minimum topic size

Key FubDocumentSource.minTopicSize
Direction Input
Level MEDIUM
DescriptionMinimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#minTopicSize()

12.15.5 Search query

Query

Key query
Direction Output
DescriptionQuery to perform.
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 100
Min value 1
Max value 100
Attribute builder AmbientDocumentSourceDescriptor.​AttributeBuilder#results()

12.15.6 Search result information

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder AmbientDocumentSourceDescriptor.​AttributeBuilder#resultsTotal()

12.15.7 Topic ID

Ambient Topic

Key AmbientDocumentSource.topic
Direction Input
Level BASIC
DescriptionAmbient Topic. The Ambient Topic to load documents from.
Required yes
Scope Processing time
Value type org.carrot2.source.ambient.AmbientDocumentSource$AmbientTopic
Default value AIDA
Allowed values
  • AIDA  (Aida)
  • B_52  (B-52)
  • BEAGLE  (Beagle)
  • BRONX  (Bronx)
  • CAIN  (Cain)
  • CAMEL  (Camel)
  • CORAL_SEA  (Coral Sea)
  • CUBE  (Cube)
  • EOS  (Eos)
  • EXCALIBUR  (Excalibur)
  • FAHRENHEIT  (Fahrenheit)
  • GLOBE  (Globe)
  • HORNET  (Hornet)
  • INDIGO  (Indigo)
  • IWO_JIMA  (Iwo Jima)
  • JAGUAR  (Jaguar)
  • LA_PLATA  (La Plata)
  • LABYRINTH  (Labyrinth)
  • LANDAU  (Landau)
  • LIFE_ON_MARS  (Life on Mars)
  • LOCUST  (Locust)
  • MAGIC_MOUNTAIN  (Magic Mountain)
  • MATADOR  (Matador)
  • METAMORPHOSIS  (Metamorphosis)
  • MINOTAUR  (Minotaur)
  • MIRA  (Mira)
  • MIRAGE  (Mirage)
  • MONTE_CARLO  (Monte Carlo)
  • OPPENHEIM  (Oppenheim)
  • OUT_OF_CONTROL  (Out of Control)
  • PELICAN  (Pelican)
  • PURPLE_HAZE  (Purple Haze)
  • RAAM  (Raam)
  • RHEA  (Rhea)
  • SCORPION  (Scorpion)
  • THE_LITTLE_MERMAID  (The Little Mermaid)
  • TORTUGA  (Tortuga)
  • URANIA  (Urania)
  • WINK  (Wink)
  • XANADU  (Xanadu)
  • ZEBRA  (Zebra)
  • ZENITH  (Zenith)
  • ZODIAC  (Zodiac)
  • ZOMBIE  (Zombie)
Attribute builder AmbientDocumentSourceDescriptor.​AttributeBuilder#topic()

Topics and subtopics covered in the output documents

Key FubDocumentSource.topicIds
Direction Output
DescriptionTopics and subtopics covered in the output documents. The set is computed for the output org.carrot2.source.ambient.FubDocumentSource.documents and it may vary for the same main topic based e.g. on the requested number of requested results or org.carrot2.source.ambient.FubDocumentSource.minTopicSize.
Scope Processing time
Value type java.util.Set
Default value none
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#topicIds()

12.16 ODP239 Test Set

Serves documents from the ODP239 test set. ODP239 is a data set designed for evaluating subtopic information retrieval. It consists of 239 topics extracted from the Open Directory Project, each with a set of subtopics and a list of about 100 documents. For more information, please see: http://credo.fub.it/odp239.

12.16.3 Documents

Documents

Key documents
Direction Output
DescriptionDocuments returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm.
Scope Processing time
Value type java.util.List
Default value none
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#documents()

12.16.4 Filtering

Include documents without topics

Key FubDocumentSource.includeDocumentsWithoutTopic
Direction Input
Level MEDIUM
DescriptionInclude documents without topics.
Required no
Scope Processing time
Value type java.lang.Boolean
Default value false
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#includeDocumentsWithoutTopic()

Minimum topic size

Key FubDocumentSource.minTopicSize
Direction Input
Level MEDIUM
DescriptionMinimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned.
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1
Min value 1
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#minTopicSize()

12.16.5 Search query

Query

Key query
Direction Output
DescriptionQuery to perform.
Scope Processing time
Value type java.lang.String
Default value none
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#query()

Results

Key results
Direction Input
Level BASIC
DescriptionMaximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words).
Required no
Scope Processing time
Value type java.lang.Integer
Default value 1000
Min value 1
Max value 1000
Attribute builder Odp239DocumentSourceDescriptor.​AttributeBuilder#results()

12.16.6 Search result information

Total results

Key results-total
Direction Output
DescriptionEstimated total number of matching documents.
Scope Processing time
Value type java.lang.Long
Default value none
Attribute builder Odp239DocumentSourceDescriptor.​AttributeBuilder#resultsTotal()

12.16.7 Topic ID

ODP239 Topic

Key Odp239DocumentSource.topic
Direction Input
Level BASIC
DescriptionODP239 Topic. The ODP239 Topic to load documents from.
Required yes
Scope Processing time
Value type org.carrot2.source.ambient.Odp239DocumentSource$Odp239Topic
Default value ARTS_ANIMATION
Allowed values
  • ARTS_ANIMATION  (Arts > Animation)
  • ARTS_ARCHITECTURE  (Arts > Architecture)
  • ARTS_BODYART  (Arts > Bodyart)
  • ARTS_COMICS  (Arts > Comics)
  • ARTS_CRAFTS  (Arts > Crafts)
  • ARTS_EDUCATION  (Arts > Education)
  • ARTS_ILLUSTRATION  (Arts > Illustration)
  • ARTS_LITERATURE  (Arts > Literature)
  • ARTS_MOVIES  (Arts > Movies)
  • ARTS_MUSIC  (Arts > Music)
  • ARTS_ONLINE_WRITING  (Arts > Online Writing)
  • ARTS_PEOPLE  (Arts > People)
  • ARTS_PERFORMING_ARTS  (Arts > Performing Arts)
  • ARTS_PHOTOGRAPHY  (Arts > Photography)
  • ARTS_RADIO  (Arts > Radio)
  • ARTS_TELEVISION  (Arts > Television)
  • ARTS_VIDEO  (Arts > Video)
  • ARTS_VISUAL_ARTS  (Arts > Visual Arts)
  • ARTS_WRITERS_RESOURCES  (Arts > Writers Resources)
  • BUSINESS_AGRICULTURE_AND_FORESTRY  (Business > Agriculture and Forestry)
  • BUSINESS_ARTS_AND_ENTERTAINMENT  (Business > Arts and Entertainment)
  • BUSINESS_AUTOMOTIVE  (Business > Automotive)
  • BUSINESS_BUSINESS_SERVICES  (Business > Business Services)
  • BUSINESS_CHEMICALS  (Business > Chemicals)
  • BUSINESS_CONSTRUCTION_AND_MAINTENANCE  (Business > Construction and Maintenance)
  • BUSINESS_CONSUMER_GOODS_AND_SERVICES  (Business > Consumer Goods and Services)
  • BUSINESS_ECOMMERCE  (Business > E-Commerce)
  • BUSINESS_EDUCATION_AND_TRAINING  (Business > Education and Training)
  • BUSINESS_ELECTRONICS_AND_ELECTRICAL  (Business > Electronics and Electrical)
  • BUSINESS_ENERGY  (Business > Energy)
  • BUSINESS_FINANCIAL_SERVICES  (Business > Financial Services)
  • BUSINESS_FOOD_AND_RELATED_PRODUCTS  (Business > Food and Related Products)
  • BUSINESS_HEALTHCARE  (Business > Healthcare)
  • BUSINESS_HOSPITALITY  (Business > Hospitality)
  • BUSINESS_HUMAN_RESOURCES  (Business > Human Resources)
  • BUSINESS_INDUSTRIAL_GOODS_AND_SERVICES  (Business > Industrial Goods and Services)
  • BUSINESS_INFORMATION_TECHNOLOGY  (Business > Information Technology)
  • BUSINESS_INVESTING  (Business > Investing)
  • BUSINESS_MANAGEMENT  (Business > Management)
  • BUSINESS_MARKETING_AND_ADVERTISING  (Business > Marketing and Advertising)
  • BUSINESS_MATERIALS  (Business > Materials)
  • BUSINESS_OPPORTUNITIES  (Business > Opportunities)
  • BUSINESS_REAL_ESTATE  (Business > Real Estate)
  • BUSINESS_RETAIL_TRADE  (Business > Retail Trade)
  • BUSINESS_SMALL_BUSINESS  (Business > Small Business)
  • BUSINESS_TELECOMMUNICATIONS  (Business > Telecommunications)
  • BUSINESS_TEXTILES_AND_NONWOVENS  (Business > Textiles and Nonwovens)
  • BUSINESS_TRANSPORTATION_AND_LOGISTICS  (Business > Transportation and Logistics)
  • COMPUTERS_ALGORITHMS  (Computers > Algorithms)
  • COMPUTERS_ARTIFICIAL_INTELLIGENCE  (Computers > Artificial Intelligence)
  • COMPUTERS_ARTIFICIAL_LIFE  (Computers > Artificial Life)
  • COMPUTERS_CAD_AND_CAM  (Computers > CAD and CAM)
  • COMPUTERS_COMPANIES  (Computers > Companies)
  • COMPUTERS_COMPUTER_SCIENCE  (Computers > Computer Science)
  • COMPUTERS_CONSULTANTS  (Computers > Consultants)
  • COMPUTERS_DATA_COMMUNICATIONS  (Computers > Data Communications)
  • COMPUTERS_DATA_FORMATS  (Computers > Data Formats)
  • COMPUTERS_EMULATORS  (Computers > Emulators)
  • COMPUTERS_GRAPHICS  (Computers > Graphics)
  • COMPUTERS_HACKING  (Computers > Hacking)
  • COMPUTERS_HARDWARE  (Computers > Hardware)
  • COMPUTERS_INTERNET  (Computers > Internet)
  • COMPUTERS_MOBILE_COMPUTING  (Computers > Mobile Computing)
  • COMPUTERS_MULTIMEDIA  (Computers > Multimedia)
  • COMPUTERS_OPEN_SOURCE  (Computers > Open Source)
  • COMPUTERS_PARALLEL_COMPUTING  (Computers > Parallel Computing)
  • COMPUTERS_PROGRAMMING  (Computers > Programming)
  • COMPUTERS_ROBOTICS  (Computers > Robotics)
  • COMPUTERS_SECURITY  (Computers > Security)
  • COMPUTERS_SOFTWARE  (Computers > Software)
  • COMPUTERS_SPEECH_TECHNOLOGY  (Computers > Speech Technology)
  • COMPUTERS_SYSTEMS  (Computers > Systems)
  • COMPUTERS_USENET  (Computers > Usenet)
  • COMPUTERS_VIRTUAL_REALITY  (Computers > Virtual Reality)
  • GAMES_BOARD_GAMES  (Games > Board Games)
  • GAMES_GAMBLING  (Games > Gambling)
  • GAMES_MINIATURES  (Games > Miniatures)
  • GAMES_ROLEPLAYING  (Games > Roleplaying)
  • GAMES_TRADING_CARD_GAMES  (Games > Trading Card Games)
  • GAMES_VIDEO_GAMES  (Games > Video Games)
  • HEALTH_ALTERNATIVE  (Health > Alternative)
  • HEALTH_ANIMAL  (Health > Animal)
  • HEALTH_BEAUTY  (Health > Beauty)
  • HEALTH_CHILD_HEALTH  (Health > Child Health)
  • HEALTH_CONDITIONS_AND_DISEASES  (Health > Conditions and Diseases)
  • HEALTH_DENTISTRY  (Health > Dentistry)
  • HEALTH_FITNESS  (Health > Fitness)
  • HEALTH_MEDICINE  (Health > Medicine)
  • HEALTH_MENTAL_HEALTH  (Health > Mental Health)
  • HEALTH_NURSING  (Health > Nursing)
  • HEALTH_NUTRITION  (Health > Nutrition)
  • HEALTH_OCCUPATIONAL_HEALTH_AND_SAFETY  (Health > Occupational Health and Safety)
  • HEALTH_PROFESSIONS  (Health > Professions)
  • HEALTH_PUBLIC_HEALTH_AND_SAFETY  (Health > Public Health and Safety)
  • HEALTH_REPRODUCTIVE_HEALTH  (Health > Reproductive Health)
  • HEALTH_SENIOR_HEALTH  (Health > Senior Health)
  • HEALTH_WOMENS_HEALTH  (Health > Women's Health)
  • HOME_CONSUMER_INFORMATION  (Home > Consumer Information)
  • HOME_COOKING  (Home > Cooking)
  • HOME_FAMILY  (Home > Family)
  • HOME_GARDENING  (Home > Gardening)
  • HOME_HOME_IMPROVEMENT  (Home > Home Improvement)
  • HOME_PERSONAL_FINANCE  (Home > Personal Finance)
  • KIDS_AND_TEENS_ARTS  (Kids and Teens > Arts)
  • KIDS_AND_TEENS_ENTERTAINMENT  (Kids and Teens > Entertainment)
  • KIDS_AND_TEENS_GAMES  (Kids and Teens > Games)
  • KIDS_AND_TEENS_HEALTH  (Kids and Teens > Health)
  • KIDS_AND_TEENS_INTERNATIONAL  (Kids and Teens > International)
  • KIDS_AND_TEENS_PEOPLE_AND_SOCIETY  (Kids and Teens > People and Society)
  • KIDS_AND_TEENS_PRESCHOOL  (Kids and Teens > Pre-School)
  • KIDS_AND_TEENS_SCHOOL_TIME  (Kids and Teens > School Time)
  • KIDS_AND_TEENS_SPORTS_AND_HOBBIES  (Kids and Teens > Sports and Hobbies)
  • KIDS_AND_TEENS_TEEN_LIFE  (Kids and Teens > Teen Life)
  • NEWS_MEDIA  (News > Media)
  • NEWS_NEWSPAPERS  (News > Newspapers)
  • NEWS_WEATHER  (News > Weather)
  • RECREATION_ANTIQUES  (Recreation > Antiques)
  • RECREATION_AUDIO  (Recreation > Audio)
  • RECREATION_AUTOS  (Recreation > Autos)
  • RECREATION_AVIATION  (Recreation > Aviation)
  • RECREATION_BIRDING  (Recreation > Birding)
  • RECREATION_BOATING  (Recreation > Boating)
  • RECREATION_CAMPS  (Recreation > Camps)
  • RECREATION_CLIMBING  (Recreation > Climbing)
  • RECREATION_COLLECTING  (Recreation > Collecting)
  • RECREATION_FOOD  (Recreation > Food)
  • RECREATION_GUNS  (Recreation > Guns)
  • RECREATION_HUMOR  (Recreation > Humor)
  • RECREATION_KITES  (Recreation > Kites)
  • RECREATION_LIVING_HISTORY  (Recreation > Living History)
  • RECREATION_MODELS  (Recreation > Models)
  • RECREATION_MOTORCYCLES  (Recreation > Motorcycles)
  • RECREATION_OUTDOORS  (Recreation > Outdoors)
  • RECREATION_PETS  (Recreation > Pets)
  • RECREATION_ROADS_AND_HIGHWAYS  (Recreation > Roads and Highways)
  • RECREATION_SCOUTING  (Recreation > Scouting)
  • RECREATION_THEME_PARKS  (Recreation > Theme Parks)
  • RECREATION_TOBACCO  (Recreation > Tobacco)
  • RECREATION_TRAINS_AND_RAILROADS  (Recreation > Trains and Railroads)
  • REFERENCE_ARCHIVES  (Reference > Archives)
  • REFERENCE_DICTIONARIES  (Reference > Dictionaries)
  • REFERENCE_EDUCATION  (Reference > Education)
  • REFERENCE_KNOWLEDGE_MANAGEMENT  (Reference > Knowledge Management)
  • REFERENCE_LIBRARIES  (Reference > Libraries)
  • REFERENCE_MAPS  (Reference > Maps)
  • REFERENCE_MUSEUMS  (Reference > Museums)
  • REFERENCE_QUOTATIONS  (Reference > Quotations)
  • SCIENCE_AGRICULTURE  (Science > Agriculture)
  • SCIENCE_ANOMALIES_AND_ALTERNATIVE_SCIENCE  (Science > Anomalies and Alternative Science)
  • SCIENCE_ASTRONOMY  (Science > Astronomy)
  • SCIENCE_BIOLOGY  (Science > Biology)
  • SCIENCE_CHEMISTRY  (Science > Chemistry)
  • SCIENCE_EARTH_SCIENCES  (Science > Earth Sciences)
  • SCIENCE_EDUCATIONAL_RESOURCES  (Science > Educational Resources)
  • SCIENCE_ENVIRONMENT  (Science > Environment)
  • SCIENCE_INSTRUMENTS_AND_SUPPLIES  (Science > Instruments and Supplies)
  • SCIENCE_MATH  (Science > Math)
  • SCIENCE_PHYSICS  (Science > Physics)
  • SCIENCE_SCIENCE_IN_SOCIETY  (Science > Science in Society)
  • SCIENCE_SOCIAL_SCIENCES  (Science > Social Sciences)
  • SCIENCE_TECHNOLOGY  (Science > Technology)
  • SHOPPING_ANTIQUES_AND_COLLECTIBLES  (Shopping > Antiques and Collectibles)
  • SHOPPING_AUCTIONS  (Shopping > Auctions)
  • SHOPPING_CHILDREN  (Shopping > Children)
  • SHOPPING_CLASSIFIEDS  (Shopping > Classifieds)
  • SHOPPING_CLOTHING  (Shopping > Clothing)
  • SHOPPING_CONSUMER_ELECTRONICS  (Shopping > Consumer Electronics)
  • SHOPPING_CRAFTS  (Shopping > Crafts)
  • SHOPPING_ENTERTAINMENT  (Shopping > Entertainment)
  • SHOPPING_ETHNIC_AND_REGIONAL  (Shopping > Ethnic and Regional)
  • SHOPPING_FOOD  (Shopping > Food)
  • SHOPPING_GENERAL_MERCHANDISE  (Shopping > General Merchandise)
  • SHOPPING_GIFTS  (Shopping > Gifts)
  • SHOPPING_HEALTH  (Shopping > Health)
  • SHOPPING_HOME_AND_GARDEN  (Shopping > Home and Garden)
  • SHOPPING_JEWELRY  (Shopping > Jewelry)
  • SHOPPING_NICHE  (Shopping > Niche)
  • SHOPPING_PETS  (Shopping > Pets)
  • SHOPPING_PHOTOGRAPHY  (Shopping > Photography)
  • SHOPPING_PUBLICATIONS  (Shopping > Publications)
  • SHOPPING_RECREATION  (Shopping > Recreation)
  • SHOPPING_SPORTS  (Shopping > Sports)
  • SHOPPING_TOOLS  (Shopping > Tools)
  • SHOPPING_TOYS_AND_GAMES  (Shopping > Toys and Games)
  • SHOPPING_VEHICLES  (Shopping > Vehicles)
  • SHOPPING_VISUAL_ARTS  (Shopping > Visual Arts)
  • SOCIETY_ACTIVISM  (Society > Activism)
  • SOCIETY_CRIME  (Society > Crime)
  • SOCIETY_DISABLED  (Society > Disabled)
  • SOCIETY_ETHNICITY  (Society > Ethnicity)
  • SOCIETY_FUTURE  (Society > Future)
  • SOCIETY_GAY_LESBIAN_AND_BISEXUAL  (Society > Gay, Lesbian, and Bisexual)
  • SOCIETY_GENEALOGY  (Society > Genealogy)
  • SOCIETY_GOVERNMENT  (Society > Government)
  • SOCIETY_HISTORY  (Society > History)
  • SOCIETY_HOLIDAYS  (Society > Holidays)
  • SOCIETY_ISSUES  (Society > Issues)
  • SOCIETY_LAW  (Society > Law)
  • SOCIETY_LIFESTYLE_CHOICES  (Society > Lifestyle Choices)
  • SOCIETY_MILITARY  (Society > Military)
  • SOCIETY_ORGANIZATIONS  (Society > Organizations)
  • SOCIETY_PARANORMAL  (Society > Paranormal)
  • SOCIETY_PEOPLE  (Society > People)
  • SOCIETY_PHILANTHROPY  (Society > Philanthropy)
  • SOCIETY_PHILOSOPHY  (Society > Philosophy)
  • SOCIETY_POLITICS  (Society > Politics)
  • SOCIETY_RELATIONSHIPS  (Society > Relationships)
  • SOCIETY_RELIGION_AND_SPIRITUALITY  (Society > Religion and Spirituality)
  • SOCIETY_SEXUALITY  (Society > Sexuality)
  • SOCIETY_SUBCULTURES  (Society > Subcultures)
  • SOCIETY_SUPPORT_GROUPS  (Society > Support Groups)
  • SOCIETY_TRANSGENDERED  (Society > Transgendered)
  • SOCIETY_WORK  (Society > Work)
  • SPORTS_ADVENTURE_RACING  (Sports > Adventure Racing)
  • SPORTS_BASEBALL  (Sports > Baseball)
  • SPORTS_BASKETBALL  (Sports > Basketball)
  • SPORTS_BOWLING  (Sports > Bowling)
  • SPORTS_BOXING  (Sports > Boxing)
  • SPORTS_CHEERLEADING  (Sports > Cheerleading)
  • SPORTS_CRICKET  (Sports > Cricket)
  • SPORTS_CYCLING  (Sports > Cycling)
  • SPORTS_DISABLED  (Sports > Disabled)
  • SPORTS_EQUESTRIAN  (Sports > Equestrian)
  • SPORTS_FANTASY  (Sports > Fantasy)
  • SPORTS_GOLF  (Sports > Golf)
  • SPORTS_HOCKEY  (Sports > Hockey)
  • SPORTS_LACROSSE  (Sports > Lacrosse)
  • SPORTS_MARTIAL_ARTS  (Sports > Martial Arts)
  • SPORTS_MOTORSPORTS  (Sports > Motorsports)
  • SPORTS_PAINTBALL  (Sports > Paintball)
  • SPORTS_RESOURCES  (Sports > Resources)
  • SPORTS_RODEO  (Sports > Rodeo)
  • SPORTS_RUNNING  (Sports > Running)
  • SPORTS_SKATEBOARDING  (Sports > Skateboarding)
  • SPORTS_SOCCER  (Sports > Soccer)
  • SPORTS_TENNIS  (Sports > Tennis)
  • SPORTS_TRACK_AND_FIELD  (Sports > Track and Field)
  • SPORTS_VOLLEYBALL  (Sports > Volleyball)
  • SPORTS_WATER_SPORTS  (Sports > Water Sports)
Attribute builder Odp239DocumentSourceDescriptor.​AttributeBuilder#topic()

Topics and subtopics covered in the output documents

Key FubDocumentSource.topicIds
Direction Output
DescriptionTopics and subtopics covered in the output documents. The set is computed for the output org.carrot2.source.ambient.FubDocumentSource.documents and it may vary for the same main topic based e.g. on the requested number of requested results or org.carrot2.source.ambient.FubDocumentSource.minTopicSize.
Scope Processing time
Value type java.util.Set
Default value none
Attribute builder FubDocumentSourceDescriptor.​AttributeBuilder#topicIds()