for version 3.17.0-SNAPSHOT
Copyright © 2002-2019 Stanisław Osiński, Dawid Weiss
Abstract
This document serves as documentation for the Carrot2 framework. It describes Carrot2 application suite and the API developers can use to integrate Carrot2 clustering algorithms into their code. It also provides a reference of all Carrot2 components and their attributes.
Carrot2 Online Demo: http://search.carrot2.org
Carrot2 website: http://project.carrot2.org
Table of Contents
List of Figures
List of Tables
List of Examples
Carrot2 is a library and a set of supporting applications you can use to build a search results clustering engine. Such an engine will organize your search results into topics, fully automatically and without external kowledge such as taxonomies or preclassified content.
Carrot2 contains two document clustering algorighms designed specifically for search results clustering: Suffix Tree Clustering and Lingo. Carrot2 offers components for fetching data from search engines that provide the required APIs (for example Microsoft Bing or PubMed), as well as other sources of documents like Lucene, Apache Solr or ElasticSearch.
Carrot2 is not a search engine itself, it does not have a crawler and indexer. There is a number of Open Source projects you can use to crawl (Nutch), index and search (Lucene, Solr) your content, which can then be queried and clustered by Carrot2
In most cases your workflow with Carrot2 applications would be the following:
Use Carrot2 Document Clustering Workbench and possibly other applications from Carrot2 application suite to see what the clustering results are like for your content. If the results are promising, you can use the Carrot2 Document Clustering Workbench to further tune the clustering algorithm's settings.
If you are developing Java software, use Carrot2 API and JAR to integrate clustering into your code. For non-Java environments, set-up the Carrot2 Document Clustering Server and call Carrot2 clustering using the REST protocol.
Chapter 2 answers the questions most frequently asked on Carrot2 mailing lists, it can also serve as a question-based index to the rest of this manual. Chapter 3 introduces applications available in Carrot2 distribution and Chapter 4 shows how to quickly set up Carrot2 to cluster your own data. Chapter 5 discusses topics related to tuning Carrot2 clustering, while Chapter 7 shows how to customize Carrot2 applications. Chapter 8 covers some more advanced use cases of Carrot2 and Chapter 9 provides solutions to common problems. Finally, Chapter 10 discusses Carrot2 architecture and internals, while Chapter 12 is an in-depth reference of Carrot2 components.
This chapter answers the questions most frequently asked on Carrot2 mailing lists. As it extensively links to further sections of the manual, it can also be treated as some sort question-based index for this manual.
|
|
Can I use Carrot2 in a commercial project? |
|
Yes. The only requirement is that you properly acknowledge the use of Carrot2 (on your project's website and documentation) and let us know about your project. Please also remember to read the license. |
|
How can I acknowledge the use of Carrot2 on my site? |
|
Please put a statement equivalent to “This product includes software developed by the Carrot2 Project” on your site and link it to Carrot2's website (http://www.carrot2.org). Additionally, you can use some of our powered-by logos if you like. |
|
Can Carrot2 crawl my website? |
|
No. Carrot2 can add clustering of search results to an existing search engine. You can use other open source projects (like Nutch or Heritrix) to crawl your website. |
|
Can I use Carrot2 to cluster something else than search results? |
|
Absolutely. Carrot2 came about as a framework for building search results clustering engines but its algorithms should successfully cluster up to about a thousand text documents, a few paragraphs each. |
|
How does Carrot2 clustering scale with respect to the number and length of documents? |
|
The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, depending on the algorithm, Carrot2 should successfully deal with up to a few thousands of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project. |
|
Can I force Carrot2 to cluster my documents to some predefined clusters / labels? |
|
No. Assigning documents to a set of predefined categories is a problem called text classification / categorization and Carrot2 was not designed to solve it. For text classification components you may want to see the LingPipe project. |
|
Can Carrot2 cluster content in other languages than English? |
|
Yes. Currently, Carrot2 can cluster content in 19 languages:
Please note, however, that for some of the languages you may need to tune the stop words to achieve best results.
|
What is the query syntax in Carrot2? |
|
As Carrot2 is not a search engine on its own, there is no common query syntax in Carrot2. The syntax depends on the underlying search engine you set Carrot2 to use, e.g. Bing, Solr, Lucene or any other. Carrot2 passes your query without any modifications to the search engine and clusters the results it returns. For this reason, any syntax supported by the search engine is automatically supported in Carrot2. |
|
Which Carrot2 clustering algorithm is the best? |
|
There is no one clear answer to this question. The choice of the algorithm depends on the input data and the desired characteristics of clusters. Please see Section 5.2 for some guidelines. |
|
Does Carrot2 support boolean querying? |
|
If the underlying search engine support boolean queries, so will Carrot2. Please see this question for more details. |
What is the most suitable content for clustering in Carrot2? |
|
Please see Section 5.1 for the answer. |
|
How can I remove meaningless cluster labels? |
|
Occasionally, Carrot2 may create meaningless cluster labels like read or site. Please see Section 5.5 for information on how to remove them. |
|
How can I improve the performance of Carrot2? |
|
Please see Section 5.7 for some clustering performance tips. |
Carrot2 comes with a suite of tools and APIs that you can use to quickly set up clustering on your own data, tune clustering results, call Carrot2 clustering from your Java or C# code or access Carrot2 clustering as a remote service.
Carrot2 distribution contains the following elements:
Carrot2 Document Clustering Workbench which is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data,
Carrot2 Java API for calling Carrot2 document clustering from your Java code,
Carrot2 C# API for calling Carrot2 document clustering from your C# or .NET code,
Carrot2 Document Clustering Server which exposes Carrot2 clustering as a REST service,
Carrot2 Command Line Interface applications which allow invoking Carrot2 clustering from command line,
Carrot2 Web Application which exposes Carrot2 clustering as a web application for end users.
Carrot2 Document Clustering Workbench is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data.
You can use Carrot2 Document Clustering Workbench to:
Quickly test Carrot2 clustering with your own data. Please see Chapter 4 for instructions for the most common scenarios.
Fine tune Carrot2 clustering algorithms' settings to work best with your specific data. Please see Chapter 5 for more details.
Run simple performance benchmarks using different settings to predict maximum clustering throughput on a single machine. Please see Section 5.8 for details.
Carrot2 Document Clustering Workbench features include:
Various document sources included. Carrot2 Document Clustering Workbench can fetch and cluster documents from a number of sources, including major search engines, indexing engines (Lucene, Solr) as well as generic XML feeds and files.
Live tuning of clustering algorithm attributes. Carrot2 Document Clustering Workbench enables modifying clustering algorithm's attributes and observing the results in real time.
Performance benchmarking. Carrot2 Document Clustering Workbench can run simple performance benchmarks of Carrot2 clustering algorithms.
Attractive visualizations. Carrot2 Document Clustering Workbench comes with two visualizations of the cluster structure, one developed within the Carrot2 project and another one from Aduna Software.
Modular architecture and extendability. Carrot2 Document Clustering Workbench is based on Eclipse Rich Client Platform, which makes it easily extendable.
To run Carrot2 Document Clustering Workbench:
Download and install Java Runtime Environment (version 1.8 or newer) if you have not done so.
Download Carrot2 Document Clustering Workbench Windows binaries or Linux binaries and extract the archive to some local disk location.
Run carrot2-workbench.exe (Windows) or carrot2-workbench (Linux).
The Carrot2 Java API package contains Carrot2 JAR files along with all dependencies, JavaDoc API reference and Java code examples. You can use this package to integrate Carrot2 clustering into your Java software. Please see Section 4.3.1 and Section 4.3.3 for instructions.
The Carrot2 C# API package contains all DLL libraries required to run Carrot2, C# API reference and code examples. You can use this package to integrate Carrot2 clustering into your C# / .NET software. Please see Section 4.3.5 for instructions.
Carrot2 Document Clustering Server (DCS) exposes Carrot2 clustering as a REST service. It can cluster documents from an external source (e.g. a search engine) or documents provided directly as an XML stream and returns results in XML or JSON formats.
You can use Carrot2 Document Clustering Server to:
Integrate Carrot2 with your non-Java software.
Build a high-throughput document clustering system by setting up a number of load-balanced instances of the DCS.
Carrot2 Document Clustering Server features include:
XML and JSON response formats. Carrot2 Document Clustering Server can return results both in XML and JSON formats. JSON-P (with callback) is also supported.
Various document sources included. Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr).
Direct XML feed. Carrot2 Document Clustering Server can cluster documents fed directly in a simple XML format.
PHP and C# examples included. Carrot2 Document Clustering Server ships with ready-to-use examples of calling Carrot2 DCS services from PHP (version 5), C#, Ruby, Java and curl.
Quick start screen. A simple quick start screen will let you make your first DCS request straight from your browser.
To run Carrot2 Document Clustering Server:
Download and install Java Runtime Environment (version 1.8.0 or newer) if you have not done so.
Download Carrot2 Document Clustering Server binaries and extract the archive to some local disk location.
Run dcs.cmd (Windows) or dcs.sh (Linux).
Point your browser to http://localhost:8080
for further instructions.
See the examples/
directory in the distribution archive
for PHP, C#, Ruby and Java code examples. You can also
invoke DCS clustering
using the curl command.
If you need to start the DCS at a port different than 8080, you can use the
-port
option:
dcs -port 9090
To deploy the DCS in an external servlet container, such as Apache Tomcat, use
the carrot2-dcs.war
file from the war/
folder of the DCS distribution.
Carrot2 Web Application exposes Carrot2 clustering as a web application for end users. It allows users to browse clusters using a conventional tree view, but also in an attractive visualization.
Carrot2 Document Clustering Server features include:
Two cluster views. Carrot2 Web Application offers two views of the clusters generated by Carrot2: conventional tree view and spatial visualizations.
All Carrot2 document sources and algorithms included. Carrot2 Web Application contains a large number of document sources, including major search engines. Optionally, further document sources can be added, such as Lucene or Solr ones. It also contains all Carrot2's clustering algorithms.
XSLT and JavaScript-based presentation layer. Look & feel of the Carrot2 Web Application can be easily changed by editing a number of XSLT style sheets. All common style sheets and JavaScripts can be re-used when implementing a new look & feel.
High-performance front-end. The front-end of the Carrot2 Web Application has been optimized for fast loading by using such techniques as JavaScript and CSS merging and minification, as well as using CSS sprites.
To run Carrot2 Web Application:
Make sure you have access to a Servlet API 2.4 compliant container, such as Apache Tomcat.
Download Carrot2 Web Application WAR file.
Deploy the WAR file to your servlet container.
Carrot2 Command Line Interface (CLI) is a set of applications that allow invoking Carrot2 clustering from the command line. Currently, the only available CLI application is Carrot2 Batch Processor, which performs Carrot2 clustering on one or more files in the Carrot2 XML format and saves the results as XML or JSON. Apart from clustering large number of documents sets at one time, you can use the Carrot2 Batch Processor to integrate Carrot2 with your non-Java applications.
To run Carrot2 Batch Processor:
Download and install Java Runtime Environment (version 1.8.0 or newer) if you have not done so.
Download Carrot2 Command Line Interface binaries and extract the archive to some local disk location.
Run batch.cmd (Windows) or batch.sh
(Linux) for an overview of the syntax. The Carrot2 Batch Processor ships with two example
input data sets located in the input/
directory.
Below is a list of some common example invocations.
To cluster one or more input files, specify their paths:
batch input/data-mining.xml input/seattle.xml
Clustering will be performed using the default clustering algorithm
and the results in the XML format will be saved to the output
directory relative to the current working directory.
You can also cluster files from one or more directories:
batch input/
Each directory will be processed recursively, i.e. including subdirectories. For each specified input directory, a corresponding directory with results will be created in the output directory.
To save results in the non-default directory, use the -o
option:
batch input/ -o results
To repeat the input documents on the output, use the -d
option:
batch input/ -d
To save the results in JSON, use the -f JSON
option:
batch input/ -f JSON
To use a different clustering algorithm, use the -a
option followed by the identifier of the algorithm:
batch input/ -a url
To see the list of available algorithm identifiers, run the application without arguments.
In case of processing errors, you can use the -v
option to see detailed messages and stack traces.
Carrot2 clustering can be performed directly within Solr by means of the Solr Clustering Component contrib extension.
A whitepaper discussing several integration strategies between Solr and Carrot2 clustering algorithms can be found at a separate GitHub repository.
Carrot2 search results clustering can be performed directly in ElasticSearch by installing a dedicated elasticsearch-carrot2 plugin. Generic plugin's installation instructions are described in detail at the plugin's GitHub web site. The API's documentation is dynamically rendered once installed (see installation instructions).
This chapter will show you how to use Carrot2 in a number of typical scenarios such as trying clustering on your own documents or integrating Carrot2 with your software.
All Carrot2 applications require Java Runtime Environment version 1.8 or later. The Carrot2 Document Clustering Workbench is distributed for Windows, Linux and MacOSX.
The Carrot2 C# API package requires the .NET Framework version 3.5 or later; it does not require a Java Runtime Environment.
This section shows how to apply Carrot2 clustering on documents from various sources.
To try Carrot2 clustering on results from search engines (such as Microsoft Bing), you can either:
or
Use the Carrot2 Document Clustering Workbench which can fetch and cluster documents from the same search engines as the Carrot2 Web Application
To try Carrot2 clustering on documents or search results stored in a single XML file you can use the Carrot2 Document Clustering Workbench.
In the Search view of Carrot2 Document Clustering Workbench, choose XML source.
Set path to your XML file in the XML Resource field.
(Optional) If your file is not in Carrot2 format, create an XSLT style sheet that transforms your data into Carrot2 format, see Section 4.2.3 for an example. Provide a path to your style sheet in the XSLT Stylesheet field in the Medium section.
If you know the query that generated the documents in your XML file, you can provide it in the Query field, which may improve the clustering results. Press the Process button to see the results.
To try Carrot2 clustering on documents or search results fetched from a remote XML feed, you can use the Carrot2 Document Clustering Workbench. As an example, we will cluster a news feed from BBC:
In the Search view of Carrot2 Document Clustering Workbench, choose XML source.
Set URL to your XML feed in the XML Resource field. Optionally, the URL can contain two special place holders that will be replaced with the Query and Results number you set in the search view.
In our example, we will use the BBC News RSS feed.
Create an XSLT style sheet that will transform the XML feed into Carrot2 format. For the news feed we can use the stylesheet shown in Figure 4.2. To add more colour to our results, the XSLT transform extracts thumbnail URLs from the feed and passes them to Carrot2 in a special attribute. Attributes that are a sequence of values can be embedded as shown in Figure 4.3.
Provide a path to the transformation style sheet in the XSLT Stylesheet field in the Medium section.
Press the Process button to see the results.
Figure 4.2 News feed XML to Carrot2 format transformation
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:media="http://search.yahoo.com/mrss"> <xsl:output indent="yes" omit-xml-declaration="no" media-type="application/xml" encoding="UTF-8" /> <xsl:template match="/"> <searchresult> <xsl:apply-templates select="/rss/channel/item" /> </searchresult> </xsl:template> <xsl:template match="item"> <document> <title><xsl:value-of select="title" /></title> <snippet> <xsl:value-of select="description" /> </snippet> <url><xsl:value-of select="link" /></url> <xsl:if test="media:thumbnail"> <field key="thumbnail-url"> <value type="java.lang.String" value="{media:thumbnail/@url}"/> </field> </xsl:if> </document> </xsl:template> </xsl:stylesheet>
To try Carrot2 clustering on documents from a local Lucene index, you can use Carrot2 Document Clustering Workbench:
In the Search view of Carrot2 Document Clustering Workbench, choose Lucene source.
Choose the path to your Lucene index in the Index directory field.
In the Medium section, choose fields from your Lucene index in at least one of Document title field and Document content field combo boxes.
Type a query and press the Process button to see the results.
To try Carrot2 clustering on documents from an instance of Apache Solr, you can use Carrot2 Document Clustering Workbench:
In the Search view of Carrot2 Document Clustering Workbench, choose Solr source.
In the Advanced section, provide the URL at which your Solr instance is available in the Service URL field.
In the Medium section, provide fields that should be used as document title, content and URL (optional) in the Title field name, Summary field name and URL field name field, respectively.
Type a query and press the Process button to see the results.
Carrot2 clustering can also be performed directly within Solr by means of Solr's Carrot2 Clustering Component.
To save doocuments and/or clusters produced by Carrot2 for further processing:
Use Carrot2 Document Clustering Workbench to perform clustering on documents from the source of your choice.
Use the File > Save as... dialog to save the documents and/or clusters into a file in the Carrot2 XML format.
Saving documents into XML can be particularly useful when there is a need to capture the output of some remote or non-public document source to a local file, which can be then passed on to someone else for further inspection. Documents saved into XML can be opened for clustering within Carrot2 Document Clustering Workbench using the XML document source.
The easiest way to integrate Carrot2 with your Java programs is to use the Carrot2 Java API package:
Download Carrot2 Java API and unpack it to some local directory.
Make sure that carrot2-core.jar
and
all JARs from the lib/
directory are available in the classpath of
your program.
Look in the examples/
directory for some sample code.
Good places to start are ClusteringDocumentList
and ClusteringDataFromDocumentSources
.
For a complete description of Carrot2 Java API, please
see Javadoc documentation in the javadoc/
directory.
You can use the build.xml
Ant script to compile and run
code from the examples/
directory.
For easier experimenting with Carrot2 Java API, you may want to set up a Carrot2 project in Eclipse IDE.
To add Carrot2 as a dependency to an existing Maven2 project:
Add the following fragment to the dependencies
section of your
pom.xml
:
<dependency> <groupId>org.carrot2</groupId> <artifactId>carrot2-core</artifactId> <version>3.17.0-SNAPSHOT</version> </dependency>
You should peek at the POM file above and enable optional
dependencies if required. For example, to enable Polish stemming,
Morfologik should be added to the dependencies
section of your
pom.xml
(version argument should match
Carrot2's POM information):
<dependency> <groupId>org.carrot2</groupId> <artifactId>morfologik-stemming</artifactId> <version>...</version> </dependency>
To support snapshot builds, add the following fragment to the
repositories
section of your
pom.xml
:
<repository> <id>sonatype-nexus-public</id> <name>SonaType public snapshots and releases repository</name> <url>https://oss.sonatype.org/content/groups/public</url> <releases> <!-- set to true if you wish to fetch releases from this repo too. --> <enabled>false</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository>
Carrot2 Java API examples can be easily set up in Eclipse IDE. The description below assumes you are using Eclipse IDE version 3.4 or newer.
Download Carrot2 Java API and unpack it to some local directory.
In your Eclipse IDE choose File > New > Java Project.
In the New Java Project dialog (Figure 4.6),
type name for the new project, e.g. carrot2-examples
.
Then choose the Create project from existing source option,
provide the directory to which you unpacked the Carrot2 Java API archive and click
Finish.
When Eclipse compiles the example classes, you can open one of them, e.g.
ClusteringDocumentList
and choose Run
> Run As > Java Application.
The output of the example program should be visible in the Console
view.
To set up Carrot2 source code, you will need Eclipse IDE version 3.5 or later with the Plug-in Development Environment (PDE). The required plugins are avaiilable e.g. in Eclipse for Plug-in Developers and Eclipse Classic distributions available at http://www.eclipse.org/downloads.
Check out Carrot2 source code using git:
git clone git://github.com/carrot2/carrot2.git
In the Package Explorer view in Eclipse IDE, choose Import... (see Figure 4.7), select General > Existing Projects into Workspace and click Next.
In the Import projects dialog provide your local
Carrot2 checkout directory in the Select root directory
field. Uncheck the org.carrot2.antlib
project
(see Figure 4.8) and click
Finish.
All Carrot2 source code should compile without errors. If it does not:
Make sure your Eclipse's Java compiler compliance level is set to 1.5 or higher (Preferences > Java > Compiler).
Make sure your Eclipse's workspace encoding is set to UTF-8 (Preferences > General > Workspace > Text file encoding).
The easiest way to integrate Carrot2 with your C# / .NET programs is to use the Carrot2 C# API package:
Make sure you have .NET framework version 3.5 or later installed in your environment.
Download Carrot2 C# API and unpack it to some local directory.
Compile example code based on the provided msbuild project file:
CD examples C:\Windows\Microsoft.NET\Framework\v4.0.30319\msbuild Carrot2.Examples.csproj
Try running the executable files generated in the examples\
folder.
The provided msbuild project is not directly compatible with Visual Studio To create a Carrot2 project in Visual Studio, import the example source code and all the referenced DLLs to an existing or newly created project.
To integrate Carrot2 with your non-Java system,
you can use the Carrot2 Document Clustering Server, which exposes Carrot2 clustering as a REST/XML service. Please
see Section 3.4.1 for installation instructions and
the examples/
directory in the distribution archive for
example code in PHP, C# and Ruby.
Carrot2 clustering requires a number of JAR files to run.
The required JARs are available in the lib/required/
folder of the Carrot2 Java API package. Some of the JARs may not be required
in certain specific situations:
log4j, slf4j-log4j Required only if using the Log4j logging framework. If your code uses a different logging framework, add a corresponding SLF4J binding to your classpath.
A number of optional JARs can be used optionally to increase the quality of clustering in certain languages or fetch search results from external sources. The purpose of the optional JARs is the following:
commons-codec, httpclient, httpcore, httpmime Used by document sources that fetch results from remote search engines, such as Bing7DocumentSource.
lucene-core, lucene-highlighter, lucene-memory Used by the LuceneDocumentSource.
rome, rome-fetcher, jdom Used by the OpenSearchDocumentSource.
lucene-analyzers-* Required for clustering Chinese and Thai content.
lucene-analyzers Required for clustering Arabic content.
morfologik-stemming Required for clustering Polish content.
This chapter discusses a number of typical fine-tuning scenarios for Carrot2 clustering algorithms. Some of the scenarios are relevant to all Carrot2 algorithms, while others are specific to individual algorithms.
The quality of clusters and their labels largely depends on the characteristics of documents provided on the input. Although there is no general rule for optimum document content, below are some tips worth considering.
Carrot2 is designed for small to medium collections of documents. The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.
Provide a minimum of 20 documents. Carrot2 clustering algorithms will work best with a set of documents similar to what is normally returned by a typical search engine. While about 20 is the minimum number of documents you can reasonably cluster, the optimum would fall in the 100 – 500 range.
Provide contextual snippets if possible. If the input documents are a result of some search query, provide contextual snippets related to that query, similar to what web search engines return, instead of full document content. Not only will this speed up processing, but also should help the clustering algorithm to cover the full spectrum of topics dealt with in the search results.
Minimize "noise" in the input documents. All kinds of "noise" in the documents, such as truncated sentences (sometimes resulting from contextual snippet extraction suggested above) or random alphanumerical strings may decrease the quality of cluster labels. If you have access to e.g. a few sentences' abstract of each document, it is worth checking the quality of clustering based on those abstracts. If you can combine this with the previous tip, i.e. extract complete sentences matching user's query, this should improve the clusters even further.
Let us once again stress that there are no definite generic guidelines for the best content for clustering, it is always worth experimenting with different combinations. You can also describe your specific application on Carrot2 mailing list and ask for advice.
Currently, Carrot2 offers two specialized search results clustering algorithms: Lingo and STC as well as an implementation of the bisecting k-means clustering. The algorithms differ in terms of the main clustering principle and hence have different quality and performance characteristics. This section describes briefly the algorithms and provides some recommendations for choosing the most suitable one.
The key characteristic of the Lingo algorithm is that it reverses the traditional clustering pipeline: it first identifies cluster labels and only then assigns documents to the labels to form final clusters. To find the labels, Lingo builds a term-document matrix for all input documents and decomposes the matrix to obtain a number of base vectors that well approximate the matrix in a low-dimensional space. Each such vector gives rise to one cluster label. To complete the clustering process, each label is assigned documents that contain the label's words.
The key data structure used in the Suffix Tree Clustering (STC) algorithm is a Generalized Suffix Tree (GST) built for all input documents. The algorithm traverses the GST to identify words and phrases that occurred more than once in the input documents. Each such word or phrase gives rise to one base cluster. The last stage of the clustering process is merging base clusters to form the final clusters.
The two algorithms have two features in common. They both create overlapping clusterings, in which one document can be assigned to more than one cluster. Also, in case of both algorithms a certain number of documents can remain unclustered and fall in the .
Bisecting k-means is a generic clustering algorithm that can also be applied to clustering textual data. As opposed to Lingo and STC, bisecting k-means creates non-overlapping clusters and does not produce the Other Topics group. Its current limitation is that it labels clusters using individual words and not all cluster's documents may correspond to the words included in the cluster label.
Table 5.1 compares the characteristics of Lingo, STC and k-means under their default settings and Figure 5.1 shows clusters generated by Lingo and STC for data mining search results.
Table 5.1 Characteristics of Lingo and STC clustering algorithms
Feature | Lingo | STC | k-means |
---|---|---|---|
Cluster diversity | High, many small (outlier) clusters highlighted | Low, small (outlier) clusters rarely highlighted | Low, small (outlier) clusters rarely highlighted |
Cluster labels | Longer, often more descriptive | Shorter, but still appropriate | One-word only, may not always describe all documents in the cluster |
Scalability | Low. For more than about 1000 documents, Lingo clustering will take a long time and large memory[a]. | High | Low, based on similar data structures as Lingo. |
It is difficult to give one clear recommendation as to which algorithm is "better". Many people feel Lingo delivers better-formed and more diverse clusters at the cost of lower performance and scalability. The ultimate judgment, however, should based on the evaluation with the specific document collection. Table 5.2 highlights the scenarios for which the algorithms are best suited.
Table 5.2 Optimum usage scenarios for Lingo and STC
Feature | Use Lingo | Use STC | Use k-means |
---|---|---|---|
Well-formed longer labels required | ![]() | ||
Highlighting of small (outlier) clusters required | ![]() | ||
High clustering performance or large document set processing required | ![]() | ||
Need non-overlapping clusters | ![]() |
The bottom line is: use Lingo, unless you need high-performance clustering of document sets larger than 1000 documents or need non-overlapping clusters.
For a more scientifically-oriented discussion and evaluation of the two algorithms, please check the publications on Carrot2 website.
Carrot Search, a company founded by Carrot2 authors, offers a commercial document clustering engine called Lingo3G that produces Lingo-quality hierarchical clusters at a better-than-STC speed. Please contact Carrot Search for details.
The best tool for experimenting and tuning Carrot2 clustering is the Carrot2 Document Clustering Workbench. Figure 5.2 shows the main components involved in the tuning process.
Figure 5.2 Tuning clustering in Carrot2 Document Clustering Workbench
|
The results editor presents documents and clusters. Changes made in the Attributes view will affect the currently active results editor. |
|
The Attributes view, where you can see and change values of clustering algorithm's attributes. |
|
The Attribute Info view, which shows documentation for specific attributes. Hold the mouse pointer over an attribute's label to see its documentation. |
Opening the Attributes view. By default, the Attributes view shows on the right hand side of the Carrot2 Document Clustering Workbench. You can open the view at any time by choosing Window > Show view > Attributes.
Setting modified attributes as default for new queries. If you modified a number of attributes for an algorithm and would like to use the modified values for new queries, choose the Set as defaults for new queries from the Attributes view's context menu (Figure 5.3).
Restoring default attribute values. To reset the attributes to their default values, choose the Reset to defaults option from the Attributes view's context menu (Figure 5.3). To bring the attributes back to their factory defaults, choose the Reset to factory defaults option.
Loading and saving attribute values to XML.
To load or save attribute values to an XML file, use the Open
and Save as... options available under the
icon on the Attributes view's menu bar.
Accessing attribute documentation. To see the documentation for a specific attribute, hold the mouse pointer over the attribute's label and its documentation will show in the Attribute Info view.
Please see Section 6.3 and Section 6.2 of Chapter 6 for details.
Please see Section 6.4 and Section 6.2 of Chapter 6 for details.
The Other Topics cluster contains documents that do not belong to any other cluster generated by the algorithm. Depending on the input documents, the size of this cluster may vary from a few to tens of documents.
By tuning parameters of the clustering algorithm, you can reduce the number of unclustered documents, however bringing the number down to 0 is unachievable in most cases. Please note that minimizing the Other Topics cluster size is usually achieved by forcing the algorithm to create more clusters, which may degrade the perceived clustering quality.
The easiest way to try different clustering algorithm settings is to use the Carrot2 Document Clustering Workbench.
To reduce the size of the Other Topics cluster generated by Lingo, you can try applying the following settings:
Change the Factorization method attribute
to LocalNonnegativeMatrixFactorizationFactory
.
Increase the Cluster count base above the default value.
Decrease the Phrase label boost. Note that this will increase the number of one-word labels, which may not always be desirable.
To apply the changes to the Carrot2 applications, please follow instructions from Chapter 7.
As a rule of thumb, the more documents you put on input and the longer the documents are, the larger clustering times. Interestingly, in many cases short document excerpts (such as contextual snippets for search results, title and abstracts or first couple sentences of non-search results) may work just as well or even better than full documents. Hence the first two most important performance tuning tips:
Reduce the size of the input documents You can achieve this in a few ways:
Rather than full text of documents, use their titles and abstracts, if available.
In case of search results, use the contextual snippet rather than the full document text. Not only will this improve clustering performance, but it will very likely increase the quality of clusters as well because you will be clustering specifically the fragments the users asked for in their query.
If you don't have document abstracts, but have access to some automatically generated summaries, use them. Otherwise, try clustering the title and the first few sentences of each document.
In certain cases, you may get decent clustering results with document titles only, this variant is worth trying too.
Reduce the number of input documents While removing large part of the input document set may not always be an option, in many cases dividing the input into two or more batches, clustering separately and then merging based on cluster label text may give reasonable results. The downside of this approach is that very small clusters containing just a few documents are likely to be lost during this process.
Further performance tuning tips are specific for each clustering algorithm.
You can change a number of attributes to increase the performance of Lingo. Most often, performance gain will be achieved at the cost of lowered clustering quality or significant change in the structure of clusters.
Lower Factorization quality,
which will cause the matrix factorization algorithm to perform fewer iterations
and hence complete quicker. Alternatively, you can set Factorization method
to org.carrot2.matrix.factorization.PartialSingularValueDecompositionFactory
,
which is slightly faster than the other factorizations. In the latter case
Factorization quality
becomes irrelevant.
Lower Maximum matrix size, which would cause the matrix factorization algorithm to complete quicker and use less memory. With small matrix sizes, Lingo may not be able to discover smaller clusters.
Not yet covered, please contact us if you need this section.
You can use the Carrot2 Document Clustering Workbench to run simple performance benchmarks of Carrot2. The benchmarks repeatedly cluster the content of the currently opened editor and report the average clustering time. You can use the benchmarking results to measure the impact of different algorithm's attribute settings on its performance and estimate the the maximum number of clustering requests that the algorithm can process per second.
To perform a performance benchmark:
Open the Benchmark view.
To asses the performance impact of different attribute settings on one algorithm, you can open two or more editors with the same results clustered by the algorithm, set different attribute values in each editor and run benchmarking for each editor separately. The benchmark view remembers the last result for each editor, so you can compare the performance figures by simply switching between the editors.
By default, the benchmarking view uses only a single processing unit on multi-processor or multi-core machines. You can increase the number of benchmark threads in the Threads section.
Benchmark results may vary and be different from the results acquired on production machines due to other programs running in the background, operating system, hardware-specific considerations and even different Java Virtual Machine settings. Always fine-tune your clustering setup in the target deployment environment.
Carrot2 will attempt to perform clustering of any textual content, regardless of the actual language the content is written in. However, certain level of shallow linguistic preprocessing usually helps in achieving better clustering and high-quality cluster labels (this is especially true when clustering smaller content, such as search results). Linguistic preprocessing includes the following components and resources:
Stemming is the act of folding grammatical variations of words into their “base” forms. In English, for example, stemming transforms plural word forms into singular ones. For highly inflectional languages, such as Central European languages, stemming may be the key to achieve good clustering results. Carrot2 uses a built-in set of stemmers from the Snowball, Lucene and Morfologik projects.
Stop words (or common words) include terms that are meaningless in the language. They are typically function words (“is”, “that”, in English) or words that are common in the analyzed body of text and should be marked as ignored. A good set of stop words helps the clustering algorithm in identifying “gaps” between other phrases that can become valuable cluster labels.
When clustering domain-specific texts, it is often desirable to filter out certain frequently occurring expressions that should not be considered clusters (“home page” for example). This resource provides means of avoiding such cluster labels.
Carrot2 comes with a set of default lexical resources which may be used as a starting point for further tuning. It is recommended to gradually build a set of customized lexical resources that matches the specific content being clustered (for example legal documents will have a different set of stop labels than a corpus of e-mails).
The user-define Carrot2 lexical resources are placed at the following application-specific locations:
Lexical resources are placed in the resources
folder under the distribution folder.
Lexical resources are placed in the resources
folder under the distribution folder. The UsingCustomLexicalResources
class demonstrates how to configure controllers to use a given path for loading
lexical resources.
Lexical resources are placed in the WEB-INF/resources
folder of the web application archive (WAR) file.
Lexical resources are placed in the WEB-INF/resources
folder of the DCS' web application archive (WAR) file. The WAR file is located
in the war/
folder under the distribution folder.
Lexical resources are extracted to the workspace folder on first launch.
The workspace folder is typically under the Workbench's distribution directory, unless
its location is modified by the -data
option is passed to
the workbench launcher at startup.
Lexical resources are placed at the root of the JAR file. The default lookup
location for the lexical resource factory is to scan context class loader's resources
and typically (if no other class loader or location that precedes the core JAR contains such resources)
these resources will be used by the implementation. Carrot2 Java API contains
an example called UsingCustomLexicalResources
that demonstrates ways of overriding the default location.
Lexical resources are embedded in the core assembly. At runtime,
all assemblies present in the stack trace of the thread initializing
the clustering controller (and thus a certain clustering algorithm)
are scanned for resources (the defaults are always scanned last). An
example class named UsingCustomLexicalResources
, that
is provided as part of Carrot2 C# API distribution, demonstrates ways of overriding
the default lexical resource search locations from .NET.
The plugin tries to load the lexical resources from the
{solr.home}/conf/clustering/carrot2
directory. If a resource is not found in the directory,
the default version of the resource is loaded from Carrot2 JAR.
A different location of lexical resources can be provided using the carrot.lexicalResourceDir Solr parameter. In particular, an absolute path can be provided to share the same lexical resources between multiple Solr cores.
The easiest way to tune the lexical resources is to use the Carrot2 Document Clustering Workbench which will allow observing the effect of the changes in real time. To tune the lexical resources in Carrot2 Document Clustering Workbench:
Start Carrot2 Document Clustering Workbench and run some query on which you'll be observing the results of your changes.
Go to the workspace/
directory which is located in
the directory to which you extracted Carrot2 Document Clustering Workbench. Modify lexical resource files as
needed and save the changes.
Open the Attributes view and use the view toolbar's
button to group the attributes by semantics. In the Preprocessing
section, make sure the Processing language is correctly set and
check the Reload resources checkbox. Doing the latter
will let you to see the updated clustering results without restarting Carrot2 Document Clustering Workbench
every time you save the changed lexical resource files.
To re-run clustering after you've saved changes to the lexical resource files, choose the Restart Processing option from the Search menu, or press Ctrl+R (Command+R on Mac OS).
Stop word files are UTF-8 encoded plain text files with a single word
in each line. Lines starting with #
are omitted
(considered to be comments). Files must follow a naming convention and be
named stopwords.lang
, where lang
is a two-letter language suffix defined in LanguageCode
class.
Example 6.1 A sample stop word file for English: stopwords.en
# stop word file for English ain't thanks need needs needed vs hit
Note that although words provided in the stop word file will be handled in a case-insensitive manner, they will otherwise be taken literally, that is no further processing, such as stemming will be applied. As a result, in order to declare that all have, has and having are function words, three entries corresponding to these words are required.
The Lingo clustering algorithm, in addition to stop words editing, offers more precise control over cluster labels by means of "stop label" regular expressions. If a cluster's label matches one of the stop labels, the label will not appear on the list of clusters produced by Lingo.
Label filtering files are UTF-8 encoded plain text files with a single
regular expression pattern in each line. Lines starting with #
are omitted
(considered to be comments). Files must follow a naming convention and be
named stoplabels.lang
, where lang
is a two-letter language suffix defined in LanguageCode
class.
Each line of a stop labels file corresponds to one stop label and is a Java regular expression. Please note that in order to be removed, a label as a whole must match at least one of the stop label expressions. A number of example stop label expressions are shown below.
Example 6.2 A sample stop label file for English: stoplabels.en
# stop label patterns for English (?i)new (?i)information (about|on).* (?i)(index|list) of.* (?i)(information|list|skip|join|cheap|access(es)?|corp(oration)?s?) (?i).*(page|part|copyright) \d+.* (?i)(official|offer(ing)?s?|lists|uses?).* (?i).*(known|information|offer(ing)?s?|a range)
All stop labels shown above start with the (?i)
prefix, which enables
case-insensitive matching for them. The stop label in the first line suppresses
labels consisting solely of the word new.
The stop label in the second line removes labels that start in information about
or information on, and the stop label in the third line removes
labels that start with index of or list of.
This chapter will show you how to add new document sources and tune clustering in Carrot2 applications.
Key concepts in customizing and tuning Carrot2 applications are component suites and component attributes described in the following sections.
Component suite is a set of Carrot2 components, such as document sources or clustering algorithms, configured to work within a specific Carrot2 application. For each component, the component suite defines the component's identifier, label, description and also a number of component- and application-specific properties, such as the list of example queries.
Component suites are defined in XML files read from application-specific locations described in further sections of this chapter. An example component suite definition is shown in Figure 7.1.
Figure 7.1 Example Carrot2 component suite
<component-suite> <sources> <source id="lucene" component-class="org.carrot2.source.lucene.LuceneDocumentSource" attribute-sets-resource="lucene.attributes.xml"> <label>Lucene</label> <title>Apache Lucene</title> <mnemonic>L</mnemonic> <description> Apache Lucene index (local index access). </description> <icon-path>icons/lucene.png</icon-path> <example-queries> <example-query>data mining</example-query> <example-query>london</example-query> <example-query>clustering</example-query> </example-queries> </source> </sources> <algorithms> <algorithm id="lingo" component-class="org.carrot2.clustering.lingo.LingoClusteringAlgorithm" attribute-sets-resource="lingo.attributes.xml"> <label>Lingo</label> <title>Lingo Clustering</title> </algorithm> </algorithms> <include suite="source-bing.xml" /> <include suite="algorithm-stc.xml" /> </component-suite>
The component suite definition can consist of the following elements:
sources
Document source definitions, optional.
algorithms
Clustering algorithm definitions, optional.
include
Includes other XML component suite definitions, optional. The resource
specified in the suite
attribute will be loaded from the current
thread's context class loader.
Common parts of the source
and algorithm
tags include:
id
Identifier of the component within the suite, required. Identifiers
must be unique within the component suite scope.
component-class
Fully qualified name of the processing component class, required.
attribute-sets-resource
XML file to load the component's attributes from. The resource specified in
this attribute will be loaded from the current thread's context
class loader. For the syntax of the XML file, please see
Section 7.1.2.
label
A human readable label of the component, required.
label
A human readable title of the component, required. The title will be usually
slightly longer than the label.
description
A longer description of the component, optional.
icon-path
Application specific definition of the component's icon.
Additionally, for the source
tag you can use the example-queries
tag
to specify some example queries the applications may show for this source.
Component attribute is a specific property of a Carrot2 component that influences its behavior, e.g. the number of search results fetched by a document source or the depth of cluster hierarchy produced by a clustering algorithm. Each attribute is identified by a unique string key, Chapter 12 lists and describes all available components and their attributes.
You can specify attribute values for specific components in the component suite
using attribute sets. Attribute sets are defined in XML files
referenced by the attribute-sets-resource
attribute of the component's
entry in the component suite. Figure 7.2
shows an example attribute set definition.
Figure 7.2 Example Carrot2 attribute set
<attribute-sets> <attribute-set id="lucene"> <value-set> <label>Lucene</label> <attribute key="LuceneDocumentSource.directory"> <value> <wrapper class="org.carrot2.source.lucene.FSDirectoryWrapper"> <indexPath>/path/to/lucene/index/directory</indexPath> </wrapper> </value> </attribute> <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.contentField"> <value type="java.lang.String" value="summary" /> </attribute> <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.titleField"> <value type="java.lang.String" value="title" /> </attribute> <attribute key="org.carrot2.source.lucene.SimpleFieldMapper.urlField"> <value type="java.lang.String" value="url" /> </attribute> </value-set> </attribute-set> </attribute-sets>
An attribute-sets
element can contain one or more
attribute-set
s. Each attribute-set
must specify a unique
id
and a value-set
.
Saving attributes to XML using Carrot2 Document Clustering Workbench
As the syntax of the value
elements depends on the type of the
attribute being set, the easiest way to obtain the XML file is to use
the Carrot2 Document Clustering Workbench.
To generate attribute set XML for a document source:
In the Search view, choose the document source for which you would like to save attributes.
Use the Search view to set the desired attribute values.
Choose the Save as... option from Search
view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of
the document source's attribute-sets-resource
attribute.
Please note that the Carrot2 Document Clustering Workbench will remove a number of common attributes from the XML file being saved, including: query, start result index, number of results.
To generate attribute set XML for a clustering algorithm:
In the Search view, choose the clustering algorithm for which you would like to save attributes. Choose any document source and perform processing using the selected algorithm.
Use the Attributes view to set the desired attribute values.
Choose the Save as... option from Attribute
view's menu bar. Carrot2 Document Clustering Workbench will suggest the XML file name based on the value of
the clustering algorithm's attribute-sets-resource
attribute.
If for some reason you cannot use the Carrot2 Document Clustering Workbench to save attribute set XML files,
you can modify the SavingAttributeValuesToXml
class from the
carrot2-examples
package to correspond to the attribute values
you would like to set and run the class to print the XML encoding of the
attribute values to the standard output.
To add a document source tab to the Carrot2 Web Application:
Open for editing the suite-webapp.xml
file, located in the
WEB-INF/suites
directory of the
WAR file.
Add a descriptor for the document source you want to add to the sources
section of the suite-webapp.xml
file. Alternatively, you
may want to use the include
element to reference one of the example
document source descriptors shipped with the application (e.g.
source-lucene.xml
). Please see
Section 7.1.1
for more information about the component suite XML file.
If the document source you are adding requires setting specific attribute values
(e.g. index location for the Lucene document source),
use
the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated
XML file in WEB-INF/suites
and make
sure it is appropriately referenced by the attribute-sets-resource
attribute of the descriptor added in the previous step.
Deploy the WAR file with the above modifications to your container. If the new document source tab is not showing, clear cookies for the domain on which the web application is deployed.
To add a document source tab to the Carrot2 Document Clustering Server:
Open for editing the suite-dcs.xml
file, located in the
WEB-INF/suites
directory of the
DCS WAR file located in the war/
of the DCS distribution.
Add a descriptor for the document source you want to add to the sources
section of the suite-dcs.xml
file. Alternatively, you may want to use the include
element to
reference one of the example document source descriptors shipped with the
application (e.g. source-lucene.xml
). Please see
Section 7.1.1
for more information about the component suite XML file.
If the document source you are adding requires setting specific attribute values
(e.g. index location for the Lucene document source),
use
the Carrot2 Document Clustering Workbench to generate the attribute set XML file. Place the generated
XML file in WEB-INF/suites
and make sure it is appropriately referenced by the attribute-sets-resource
attribute of the descriptor added in the previous step.
Restart the DCS. The new document source should be available for processing.
To run the Carrot2 Web Application with custom attributes of the Lingo clustering algorithm:
Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.
Replace the contents of lingo.attributes.xml
, located in
the WEB-INF/suites
directory of the web application
WAR file, with the XML file saved in the previous step.
Deploy the WAR file with the above modifications to your container.
You can use the same procedure to customize other algorithms, e.g. STC.
To run the Carrot2 Document Clustering Server with custom attributes of the Lingo clustering algorithm:
Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.
Replace the contents of algorithm-lingo-attributes.xml
, located in
the WEB-INF/suites
directory of the DCS
WAR file, located in the war/
directory of the DCS distribution,
with the XML file saved in the previous step.
Restart the DCS.
You can use the same procedure to customize other algorithms, e.g. STC.
To run the Carrot2 Command Line Interface with custom attributes of the Lingo clustering algorithm:
Use the Carrot2 Document Clustering Workbench to save the attribute set XML file with the desired Lingo attribute values.
Replace the contents of algorithm-lingo-attributes.xml
, located in
the /suites
directory of the CLI distribution,
with the XML file saved in the previous step.
Run the CLI application.
You can use the same procedure to customize other algorithms, e.g. STC.
The Java API distribution package contains examples showing how to customize
attributes of the clustering algorithms. Please see the
org.carrot2.examples.clustering.UsingAttributes
class or
the JavaDoc overview page.
Not yet covered, please contact us if you need this section.
This chapter discusses more advanced usage scenarios of Carrot2 such as running Carrot2 applications in Eclipse and building Carrot2 from source code.
To run Carrot2 Document Clustering Workbench in Eclipse IDE (version 3.4 or higher required):
Choose Window > Preferences and
then Run/Debug > String substitution.
Add a temp_workspaces
variable pointing to a an existing disk
directory where the Workbench's workspace should be created.
Choose Run > Run Configurations... from
the main menu and run the Workbench
configuration.
To run Carrot2 Document Clustering Workbench in Eclipse IDE:
Choose Run > External Tools >
External Tools Configurations... from
the main menu and run the Web Application Setup [carrot2]
configuration.
This will preprocess various configuration files required by the web application.
Choose Run > Run Configurations... from
the main menu and run the Web Application Runner [carrot2]
configuration.
Point your browser to http://localhost:8080 to access the running web application.
To build Carrot2 applications from source code, you will need Java Software Development Kit (Java SDK) version 1.8 or higher and Apache Ant version 1.9.3 or higher. You can chcek out the latest Carrot2 source code using git:
git clone git://github.com/carrot2/carrot2.git
To build Carrot2 Document Clustering Workbench from source code:
Download Eclipse Target Platform from http://download.carrot2.org/eclipse and extract to some local folder.
Copy local.properties.example
from Carrot2 checkout folder
to local.properties
in the same folder. In
local.properties
edit the target.platform
property to point to the Eclipse Target Platform you have downloaded.
The folder pointed to by target.platform
must have the eclipse/
folder inside.
You may also change the configs
property to
match the platform you want to build Carrot2 Document Clustering Workbench for or rely on auto-detection.
Run:
ant workbench
to build Carrot2 Document Clustering Workbench binaries.
Go to the tmp/ workbench/ tmp/ carrot2-workbench
folder in the Carrot2 checkout dir and run Carrot2 Document Clustering Workbench.
You can use curl to post requests to the Carrot2 Document Clustering Server
Figure 8.2 shows how to use curl
to query an external document source and cluster the results using the DCS.
Figure 8.3 shows how to cluster documents
from an XML file in Carrot2 format using the DCS.
Please see the examples/curl
directory of the Carrot2 Document Clustering Server distribution
archive for more curl DCS invocation examples.
Figure 8.2 Using DCS and curl to cluster data from document source
curl http://localhost/dcs/rest \ -F "dcs.source=etools" \ -F "query=test" \ -o result.xml
Figure 8.3 Using DCS and curl to cluster data from document source
curl http://localhost/dcs/rest \ -F "dcs.c2stream=@documents-in-carrot2-format.xml" \ -o result.xml
You can download curl for Windows from http://curl.haxx.se/latest.cgi?curl=win32-nossl.
If your server or development machine connects to HTTP servers via a HTTP proxy, you can most of Carrot2 document source implementations to take this information into account by defining the following global system properties:
URL of the HTTP proxy (numeric or full address, but without the port number).
Proxy server's port number.
Two sources that currently do not support the above properties are: Bing7DocumentSource and OpenSearchDocumentSource.
If your document source initiates HTTP connections to a server protected with BASIC or DIGEST HTTP authentication, you will have to pass the username and password to the application so that such connections can be established. Define the following global system properties (they are picked up once, when the Controller is created):
Username for BASIC or DIGEST authentication.
Password for BASIC or DIGEST authentication.
Note that, in general, it's better not to have any HTTP authentication at all since it's a very weak form of protection anyway and only increases network traffic (two HTTP requests may have to be made in order to fetch the remote resource).
This chapter discusses solutions to some common problems with Carrot2 code or applications.
To increase Java heap size for Carrot2 Document Clustering Workbench, use the following command line parameters:
carrot2-workbench -vmargs -Xmx256m
Using the above pattern you can specify any other JVM options if needed.
You can also add JVM path and options to the eclipse.ini
file located in in Carrot2 Document Clustering Workbench installation directory. Please see
Eclipse Wiki
for a list of all available options.
To get the stack trace (useful for Carrot2 team to spot errors) corresponding to a processing error in Carrot2 Document Clustering Workbench, follow the following procedure:
Click OK on the Problem Occurred dialog box (Figure 9.1).
Go to Window > Show view > Other... and choose Error Log (Figure 9.2).
In the Error Log view double click the line corresponding to the error (Figure 9.3).
Copy the exception stack trace from the Event Details dialog and pass to Carrot2 team (Figure 9.4).
If you see question marks ("?") instead of Chinese, Polish or other special Unicode characters in clusters and documents output by the Carrot2 Web Application
The Carrot2 Web Application running under a Web application container (such as Tomcat) relies on proper decoding of Unicode characters from the request URI. This decoding is done by the container and must be properly configured at the container level. Unfortunately, this configuration is not part of the J2EE standard and is therefore different for each container.
For Apache Tomcat, you can enforce the URI decoding code page at the connector
configuration level. Locate server.xml
file inside
Tomcat's conf
folder and add the following attribute to
the Connector
section:
URIEncoding="UTF-8"
A typical connector configuration should look like this:
<Connector port="8080" maxThreads="25" minSpareThreads="5" maxSpareThreads="10" minProcessors="5" maxProcessors="25" enableLookups="false" redirectPort="8443" acceptCount="10" debug="0" connectionTimeout="20000" URIEncoding="UTF-8" />
This chapter discusses some Carrot2 architecture assumptions, internals and more complex API use cases.
This section provides a very brief overview of Carrot2 architecture. If you would like us to cover some specific topic in more detail, please let us know on the mailing list.
Processing in Carrot2 is based on a pipeline of processing components. The two main types of Carrot2 processing components are:
Document Sources provide data for further processing. In a typical scenario, such a component would fetch search results from e.g. an external search engine, Lucene / Solr index or an XML file. Currently, Carrot2 distribution contains 12 different document source components.
Clustering Algorithms organize documents provided by document sources into meaningful groups. Currently, two specialized clustering algorithms are available in Carrot2: Lingo and STC. Additionally, a number of "synthetic" clustering algorithms are available, such as by URL clustering.
Carrot2 applications, such as Carrot2 Document Clustering Workbench or Carrot2 Document Clustering Server operate on a pipeline consisting of one document source and one clustering algorithm, but using Carrot2 Java API you can insert additional components at any point in the pipeline. Currently, the only component not falling into the above categories is a component for computing certain cluster quality metrics, but more components may be added in the future, e.g. for spell checking of user queries.
The behavior of both document sources and clustering algorithms depends on a number of attributes (settings) such as the number of documents to fetch or the number of clusters to produce. The way you provide attribute values for specific components depends on the Carrot2 application you are working with:
Carrot2 Document Clustering Workbench. In Carrot2 Document Clustering Workbench you can provide attributes for document sources (such as number of results to fetch or preferred results language) before you issue a query in the Search view. Clustering algorithm attributes you can change using the sliders in the Attributes view.
Carrot2 Document Clustering Server. In Carrot2 Document Clustering Server, you can provide attribute values as additional parameters in the POST request. Name of the POST parameter should be the identifier of the attribute you want to set (see Chapter 12 for attribute identifiers). Carrot2 will attempt to convert the string value of the parameter to the required type (integer, float etc.).
For a complete reference of attributes of each Carrot2 component, please see Chapter 12.
This section shows examples of Carrot2 input and output XML formats, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Workbench, Carrot2 Document Clustering Server and Carrot2 Web Application.
To provide documents for Carrot2 clustering, use the following XML format:
Figure 10.1 Carrot2 input XML format
<?xml version="1.0" encoding="UTF-8"?> <searchresult> <query>Globe</query> <document id="0"> <title>default</title> <url>http://www.globe.com.ph/</url> <snippet> Provides mobile communications (GSM) including GenTXT, handyphones, wireline services, an broadband Internet services. </snippet> </document> <document id="1"> <title>Skate Shoes by Globe | Time For Change</title> <url>http://www.globeshoes.com/</url> <snippet> Skaters, surfers, and showboarders designing in their own style. </snippet> </document> ... </searchresult>
Carrot2 saves the clusters in the following XML format:
Figure 10.2 Carrot2 output XML format
<?xml version="1.0" encoding="UTF-8"?> <searchresult> <query>Globe</query> <document id="0"> <title>default</title> <url>http://www.globe.com.ph/</url> <snippet> Provides mobile communications (GSM) including GenTXT, handyphones, wireline services, an broadband Internet services. </snippet> </document> <document id="1"> <title>Skate Shoes by Globe | Time For Change</title> <url>http://www.globeshoes.com/</url> <snippet> Skaters, surfers, and showboarders designing in their own style. </snippet> </document> ... <group id="0" size="60" score="1.0"> <title> <phrase>com</phrase> </title> <group id="1" size="2" score="1.0"> <title> <phrase>amazon.com</phrase> </title> <document refid="43"/> <document refid="77"/> </group> <group id="2" size="2" score="0.8"> <title> <phrase>boston.com</phrase> </title> <document refid="4"/> <document refid="7"/> </group> ... <group id="7" size="48"> <title> <phrase>Other Sites</phrase> </title> <attribute key="other-topics"> <value type="java.lang.Boolean" value="true"/> </attribute> <document refid="1"/> <document refid="2"/> ... </group> </group> <group id="8" size="12" score="0.72"> <title> <phrase>org</phrase> </title> <group id="9" size="2" score="1.0"> <title> <phrase>en.wikipedia.org</phrase> </title> <document refid="9"/> <document refid="14"/> ... </group> </group> ... </searchresult>
This section shows examples of Carrot2 output JSON format, used consistently by all Carrot2 applications, including Carrot2 Document Clustering Server and Carrot2 Java API.
Carrot2 saves documents and the clusters in the following JSON format:
Figure 10.3 Carrot2 output JSON format
{ "clusters": [ { "attributes": { "score": 1.0 }, "documents": [ 0, 2 ], "id": 0, "phrases": [ "Cluster 1" ], "score": 1.0, "size": 2 }, { "attributes": { "score": 0.63 }, "clusters": [ { "attributes": { "score": 0.3 }, "documents": [ 1 ], "id": 2, "phrases": [ "Cluster 2.1" ], "score": 0.3, "size": 1 }, { "attributes": { "score": 0.15 }, "documents": [ 2 ], "id": 3, "phrases": [ "Cluster 2.2" ], "score": 0.15, "size": 1 } ], "documents": [ 0 ], "id": 1, "phrases": [ "Cluster 2" ], "score": 0.63, "size": 3 } ], "documents": [ { "id": 0, "snippet": "Document 1 Content.", "title": "Document 1 Title", "url": "http://document.url/1" }, { "id": 1, "snippet": "Document 2 Content.", "title": "Document 2 Title", "url": "http://document.url/2" }, { "id": 2, "snippet": "Document 3 Content.", "title": "Document 3 Title", "url": "http://document.url/3" } ], "query": "query (optional)" }
This chapter contains information for Carrot2 developers.
Each Carrot2 release should be performed according to the following procedure:
Precondition: resolved issues All issues related to the software to be released scheduled (fix for) for the release must be resolved.
Precondition: successful continuous integration builds The status of the all builds must be successful. For bugfixing releases, check appropriate build on the server.
Update source code headers and line endings
ant prerelease
Commit changes to trunk.
Review Maven dependencies are in sync
(cd etc/maven/poms; mvn dependency:tree )
Review Maven POMs to ensure dependencies are in sync with the JAR versions in the repository.
Run all the tests and distribution target
git clean -xfd # removes any local files, including settings! ant -Dlocal.properties=local.properties.example -Dtools.dir=... clean dist
Everything should pass. Extra tools repo will be required.
Generate and verify JavaDocs
ant javadoc # (already in dist)
Review JavaDoc documentation, provide missing public and protected members description, provide missing package descriptions.
Generate and verify Carrot2 Manual
ant doc # (already in dist)
Review Carrot2 Manual, modify or add content related to the features implemented in the new release.
Review static code analysis reports
ant reports
Review and fix reasonably-looking flaws.
Update version number strings
Update carrot2.version
and remove -SNAPSHOT
suffix. This number will be embedded in
distribution file names, JavaDoc page title and other version-sensitive places.
Generate API XML file and API differences. Pick the previous version to compare against (typically the previous version on the branch). Generate API XML and a comparison report:
ant clean jdiff-compare -Dversion.previous=x.y.z
Copy API XML report for future comparisons:
cp tmp/compatibility-report/*.xml etc/jdiff/
Commit changes. Push.
Trigger stable build in Bamboo. Go to Carrot2 Bamboo (requires admin privileges) and trigger a stable build. If the build is successful, all distribution files should be available in the download directory. This is the "candidate" release.
Verify the distribution files Download, unpack and run each distribution file to make sure there are no obvious release blockers.
Create an annotated release tag and push changes.
git tag -a release/x.y.z -m "Release x.y.z"
Trigger stable build in Bamboo. Go to Carrot2 Bamboo (requires admin privileges) and trigger a stable build again. This is the final release.
Publish maven artefacts. First,
ant maven.deploy
this pushes a release to SonaType's staging area (appropriate sonatype server configuration in
~/.m2/settings.xml
and GPG keys in ~/.gnupg/
required).
Log in to SonaType, close the release
bundle and publish. This can be done later from the tagged revision.
Bump version number strings
Bump version number to the next anticipated version and add
-SNAPSHOT
.
Commit changes.
Update JIRA Close issues scheduled for the release being made, release the version in JIRA, create a next version in JIRA.
Release on github and update downloads
Staging server has build files (/srv/vhosts/get.carrot2.org/head/
)
upload them to github (https://github.com/carrot2/carrot2/releases
) and create release news for the tag.
Update project website
Release notes
Add a page named release-[version]-notes
that
lists new features, major bug fixes and improvements introduced in the
new release.
Release note history
Add release date and link to the release's JIRA issues on the
release-notes
page.
Circulate release news If appropriate, circulate release news to:
Carrot2 mailing lists
Update Wikipedia page If appropriate, update Carrot2 page on Wikipedia.
Consider upgrading Carrot2 in dependent projects If reasonable, upgrade Carrot2 dependency in other known projects, such as Apache Solr.
Carrot2 uses version identifiers consisting of three, dot-separated numbers:
product-line
.major
.minor
. This scheme
is modelled after Maven's POM versions and has the following interpretation:
Indicates long-term product line identifier. This number will not change frequently as it reflects major changes in the internal architecture or shipped software components. Reading release notes is a must, the internal programming interfaces very likely changed significantly.
Major revision number changes indicate addition of significant new features, performance optimizations or new front-end software components added to Carrot2. Reading release notes is highly recommended because programming interfaces may change slightly from major to major revision.
Minor revision numbers are reserved for shipped product updates and bug fixes. These may include critical bug fixes as well as patches increasing performance, but not changing the programming interfaces. Reading release notes is recommended, but a drop-in upgrade should work without any extra work.
The git repository is organized so that the master
branch tracks the development
of the next major revision. Bugfix branches track minor revisions
of already shipped versions. A tag is created for each shipped version. Branch
and tag names follow the naming conventions below.
The master branch is equivalent to the next major software revision
being developed and is not numbered explicitly, but corresponds to
branch vX.Y.0
, where Y
is
the next major revision to be shipped. It is possible to create a minor release
off the trunk directly if the commit log only includes bug fixes.
A branch named bugfix/X.Y.z
tracks the product shipped as
X.Y.z
, where the z
component is the next
minor release to be shipped from this branch. Once shipped, a tag should
be created.
A tag named release/X.Y.Z
should be created for
exactly that development branch at the time of shipment.
This a very quick quality assurance check list to run through before stable releases. This list also serves as some guide line for further automation of acceptance tests.
Note that this list does not contain many checks for the Carrot2 Web Application, Carrot2 Document Clustering Server and Carrot2 Java API as these are fairly well tested during builds (webtests, smoke-tests).
For each supported platform you can test, check that Carrot2 Document Clustering Workbench:
launches without errors in the error log
executes and cluters a remote search query without errors
executes and clusters a Lucene query without errors (we've had a bug that caused the Lucene directory attribute editor to disappear, hence this step).
can edit a clustering algorithm's attribute
shows both cluster visualizations
executes clustering algorithm benchmarks
Check that a the Carrot2 Document Clustering Server starts up correctly using command line on Windows and Linux. More acceptance tests are performed during builds (but starting Carrot2 Document Clustering Server using the WAR file instead of command line).
This section lists and describes attributes of all Carrot2 components. By changing values of these attributes, you can change the behaviour of the component. Please see Chapter 7 for information on how you pass attribute values in different Carrot2 applications.
Each attribute is described by a number of properties:
Key The unique identifier of the attribute.
Direction
Input The attribute is an input for the component, the behaviour of the component depends on its value.
Output The attribute is an output produced by the component.
Level Informs how advanced the attribute is.
Basic Attribute value should be fairly easily tunable by a person without significant experience in text clustering.
Medium Attribute value should be fairly easily tunable by a person without some intuition about text clustering
Advanced Attribute may require in-depth knowledge of the component for successful tuning.
Required If true and the attribute does not have a default value, a value must be provided for the component to perform processing.
Scope
Initialization time Attribute value will be respected only when the component is initializing; values provided at processing time will be ignored. This scope applies to the attributes that control time-consuming operations performed once per component instance (e.g. parsing of configuration files). As a result, only a handful of attributes fall into the initialization-time only scope.
Processing time Attribute values will be respected both at initialization and clustering time. Most of the attributes fall into this scope.
Please note that certain attributes can be both initialization- and processing-time. In most such cases it is advisable to provide the value at initialization time because processing the same value passed at processing time may degrade the performance a little (e.g. due to re-reading configuration files).
Value type The Java type of the attribute's value.
Default value The default value of the attribute or none if there is no default value defined for the attribute.
Key |
documents
|
Direction |
Input
|
Level |
BASIC
|
Description | Documents to cluster. |
Required |
no
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
ByFieldClusteringAlgorithmDescriptor.AttributeBuilder#documents()
|
Key |
ByAttributeClusteringAlgorithm.fieldName
|
Direction |
Input
|
Level |
BASIC
|
Description | Name of the field to cluster by.
Each non-null scalar field value with distinct hash code will give rise to a single cluster, named using the value returned by org.carrot2.clustering.synthetic.ByFieldClusteringAlgorithm.buildClusterLabel(Object) . If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
sources
|
Value content | Must not be blank |
Attribute builder |
ByFieldClusteringAlgorithmDescriptor.AttributeBuilder#fieldName()
|
Key |
clusters
|
Direction |
Output
|
Description | Clusters created by the algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
ByFieldClusteringAlgorithmDescriptor.AttributeBuilder#clusters()
|
Key |
documents
|
Direction |
Input
|
Level |
BASIC
|
Description | Documents to cluster. |
Required |
no
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
ByUrlClusteringAlgorithmDescriptor.AttributeBuilder#documents()
|
Key |
clusters
|
Direction |
Output
|
Description | Clusters created by the algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
ByUrlClusteringAlgorithmDescriptor.AttributeBuilder#clusters()
|
Key |
BisectingKMeansClusteringAlgorithm.clusterCount
|
Direction |
Input
|
Level |
BASIC
|
Description | The number of clusters to create. The algorithm will create at most the specified number of clusters. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
25
|
Min value |
2
|
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#clusterCount()
|
Key |
BisectingKMeansClusteringAlgorithm.labelCount
|
Direction |
Input
|
Level |
BASIC
|
Description | Label count. The minimum number of labels to return for each cluster. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
3
|
Min value |
1
|
Max value |
10
|
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#labelCount()
|
Key |
documents
|
Direction |
Input
|
Level |
BASIC
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Required |
yes
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#documents()
|
Key |
BisectingKMeansClusteringAlgorithm.maxIterations
|
Direction |
Input
|
Level |
BASIC
|
Description | The maximum number of k-means iterations to perform. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
15
|
Min value |
1
|
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#maxIterations()
|
Key |
BisectingKMeansClusteringAlgorithm.partitionCount
|
Direction |
Input
|
Level |
BASIC
|
Description | Partition count. The number of partitions to create at each k-means clustering iteration. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
2
|
Min value |
2
|
Max value |
10
|
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#partitionCount()
|
Key |
BisectingKMeansClusteringAlgorithm.useDimensionalityReduction
|
Direction |
Input
|
Level |
BASIC
|
Description | Use dimensionality reduction.
If true , k-means will be applied on the dimensionality-reduced term-document matrix with the number of dimensions being equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. If false , the k-means will be performed directly on the original term-document matrix. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#useDimensionalityReduction()
|
Key |
TermDocumentMatrixBuilder.titleWordsBoost
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Title word boost.
Gives more weight to words that appeared in org.carrot2.core.Document.TITLE fields. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
2.0
|
Min value |
0.0
|
Max value |
10.0
|
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#titleWordsBoost()
|
Key |
TermDocumentMatrixReducer.factorizationFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.matrix.factorization.IMatrixFactorizationFactory
|
Default value |
org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory
|
Allowed value types |
Allowed value types:
|
Attribute builder |
TermDocumentMatrixReducerDescriptor.AttributeBuilder#factorizationFactory()
|
Key |
TermDocumentMatrixReducer.factorizationQuality
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.matrix.factorization.IterationNumberGuesser$FactorizationQuality
|
Default value |
HIGH
|
Allowed values |
|
Attribute builder |
TermDocumentMatrixReducerDescriptor.AttributeBuilder#factorizationQuality()
|
Key |
TermDocumentMatrixBuilder.maximumMatrixSize
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
37500
|
Min value |
5000
|
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#maximumMatrixSize()
|
Key |
TermDocumentMatrixBuilder.maxWordDf
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum word document frequency.
The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4 , words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear. This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.9
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#maxWordDf()
|
Key |
TermDocumentMatrixBuilder.termWeighting
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Term weighting. The method for calculating weight of words in the term-document matrices. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.text.vsm.ITermWeighting
|
Default value |
org.carrot2.text.vsm.LogTfIdfTermWeighting
|
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#termWeighting()
|
Key |
MultilingualClustering.defaultLanguage
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Default clustering language.
The default language to use for documents with undefined org.carrot2.core.Document.LANGUAGE . |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.core.LanguageCode
|
Default value |
ENGLISH
|
Allowed values |
|
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#defaultLanguage()
|
Key |
MultilingualClustering.languageCounts
|
Direction |
Output
|
Description | Document languages. The number of documents in each language. Empty string key means unknown language. |
Scope | Processing time |
Value type |
java.util.Map
|
Default value | none |
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#languageCounts()
|
Key |
MultilingualClustering.languageAggregationStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Language aggregation strategy.
Determines how clusters generated for individual languages should be combined to form the final result. Please see org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy for the list of available options. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
|
Default value |
FLATTEN_MAJOR_LANGUAGE
|
Allowed values |
|
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#languageAggregationStrategy()
|
Key |
MultilingualClustering.majorityLanguage
|
Direction |
Output
|
Description | Majority language.
If org.carrot2.text.clustering.MultilingualClustering.languageAggregationStrategy is org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.CLUSTER_IN_MAJORITY_LANGUAGE , this attribute will provide the majority language that was used to cluster all the documents. If the majority of the documents have undefined language, this attribute will be empty and the clustering will be performed in the org.carrot2.text.clustering.MultilingualClustering.defaultLanguage . |
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#majorityLanguage()
|
Key |
Tokenizer.documentFields
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Textual fields of documents that should be tokenized and parsed for clustering. |
Required |
no
|
Scope | Initialization time |
Value type |
java.util.Collection
|
Default value |
[title, snippet]
|
Attribute builder |
TokenizerDescriptor.AttributeBuilder#documentFields()
|
Key |
PreprocessingPipeline.lexicalDataFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Lexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.ILexicalDataFactory
|
Default value |
org.carrot2.text.linguistic.DefaultLexicalDataFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#lexicalDataFactory()
|
Key |
merge-resources
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Merges stop words and stop labels from all known languages.
If set to false , only stop words and stop labels of the active language will be used. If set to true , stop words from all org.carrot2.core.LanguageCode s will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#mergeResources()
|
Key |
reload-resources
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Reloads cached stop words and stop labels on every processing request.
For best performance, lexical resource reloading should be disabled in production. This flag is reset to |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#reloadResources()
|
Key |
resource-lookup
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Lexical resource lookup facade. By default, resources are sought in the current thread's context class loader. An override of this attribute is possible both at the initialization time and at processing time. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.util.resource.ResourceLookup
|
Default value |
org.carrot2.util.resource.ResourceLookup
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#resourceLookup()
|
Key |
PreprocessingPipeline.stemmerFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Stemmer factory. Creates the stemmers to be used by the clustering algorithm. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.IStemmerFactory
|
Default value |
org.carrot2.text.linguistic.DefaultStemmerFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#stemmerFactory()
|
Key |
PreprocessingPipeline.tokenizerFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.ITokenizerFactory
|
Default value |
org.carrot2.text.linguistic.DefaultTokenizerFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#tokenizerFactory()
|
Key |
CaseNormalizer.dfThreshold
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Word Document Frequency threshold.
Words appearing in fewer than dfThreshold documents will be ignored. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1
|
Min value |
1
|
Max value |
100
|
Attribute builder |
CaseNormalizerDescriptor.AttributeBuilder#dfThreshold()
|
Key |
clusters
|
Direction |
Output
|
Description | Clusters created by the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#clusters()
|
Key |
BisectingKMeansClusteringAlgorithm.preprocessingPipeline
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Common preprocessing tasks handler. |
Required |
no
|
Scope | Initialization time |
Value type |
org.carrot2.text.preprocessing.pipeline.IPreprocessingPipeline
|
Default value |
org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline
|
Attribute builder |
BisectingKMeansClusteringAlgorithmDescriptor.AttributeBuilder#preprocessingPipeline()
|
Key |
LingoClusteringAlgorithm.desiredClusterCountBase
|
Direction |
Input
|
Level |
BASIC
|
Description | Desired cluster count base. Base factor used to calculate the number of clusters based on the number of documents on input. The larger the value, the more clusters will be created. The number of clusters created by the algorithm will be proportional to the cluster count base, but not in a linear way. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
30
|
Min value |
2
|
Max value |
100
|
Attribute builder |
LingoClusteringAlgorithmDescriptor.AttributeBuilder#desiredClusterCountBase()
|
Key |
LingoClusteringAlgorithm.clusterMergingThreshold
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Cluster merging threshold. The percentage overlap between two cluster's documents required for the clusters to be merged into one clusters. Low values will result in more aggressive merging, which may lead to irrelevant documents in clusters. High values will result in fewer clusters being merged, which may lead to very similar or duplicated clusters. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.7
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
ClusterBuilderDescriptor.AttributeBuilder#clusterMergingThreshold()
|
Key |
LingoClusteringAlgorithm.scoreWeight
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Balance between cluster score and size during cluster sorting. Value equal to 0.0 will cause Lingo to sort clusters based only on cluster size. Value equal to 1.0 will cause Lingo to sort clusters based only on cluster score. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.0
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
LingoClusteringAlgorithmDescriptor.AttributeBuilder#scoreWeight()
|
Key |
documents
|
Direction |
Input
|
Level |
BASIC
|
Description | Documents to cluster. |
Required |
yes
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
LingoClusteringAlgorithmDescriptor.AttributeBuilder#documents()
|
Key |
LingoClusteringAlgorithm.labelAssigner
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Cluster label assignment method. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.clustering.lingo.ILabelAssigner
|
Default value |
org.carrot2.clustering.lingo.UniqueLabelAssigner
|
Allowed value types | Allowed value types: No other assignable value types are allowed. |
Attribute builder |
ClusterBuilderDescriptor.AttributeBuilder#labelAssigner()
|
Key |
LingoClusteringAlgorithm.phraseLabelBoost
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Phrase label boost. The weight of multi-word labels relative to one-word labels. Low values will result in more one-word labels being produced, higher values will favor multi-word labels. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
1.5
|
Min value |
0.0
|
Max value |
10.0
|
Attribute builder |
ClusterBuilderDescriptor.AttributeBuilder#phraseLabelBoost()
|
Key |
LingoClusteringAlgorithm.phraseLengthPenaltyStart
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Phrase length penalty start.
The phrase length at which the overlong multi-word labels should start to be penalized. Phrases of length smaller than phraseLengthPenaltyStart will not be penalized. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
8
|
Min value |
2
|
Max value |
8
|
Attribute builder |
ClusterBuilderDescriptor.AttributeBuilder#phraseLengthPenaltyStart()
|
Key |
LingoClusteringAlgorithm.phraseLengthPenaltyStop
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Phrase length penalty stop.
The phrase length at which the overlong multi-word labels should be removed completely. Phrases of length larger than phraseLengthPenaltyStop will be removed. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
8
|
Min value |
2
|
Max value |
8
|
Attribute builder |
ClusterBuilderDescriptor.AttributeBuilder#phraseLengthPenaltyStop()
|
Key |
GenitiveLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove labels ending in genitive form. Removes labels that do end in words in the Saxon Genitive form (e.g. "Threatening the Country's"). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
GenitiveLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
StopWordLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove leading and trailing stop words. Removes labels that consist of, start or end in stop words. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
StopWordLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
NumericLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove numeric labels. Remove labels that consist only of or start with numbers. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
NumericLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
QueryLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove query words. Removes labels that consist only of words contained in the query. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
QueryLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
MinLengthLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove labels shorter than 3 characters. Removes labels whose total length in characters, including spaces, is less than 3. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
MinLengthLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
StopLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove stop labels. Removes labels that are declared as stop labels in the stoplabels.<lang> files. Please note that adding a long list of regular expressions to the stoplabels file may result in a noticeable performance penalty. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
StopLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
CompleteLabelFilter.enabled
|
Direction |
Input
|
Level |
BASIC
|
Description | Remove truncated phrases. Tries to remove "incomplete" cluster labels. For example, in a collection of documents related to Data Mining, the phrase Conference on Data is incomplete in a sense that most likely it should be Conference on Data Mining or even Conference on Data Mining in Large Databases. When truncated phrase removal is enabled, the algorithm would try to remove the "incomplete" phrases like the former one and leave only the more informative variants. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
CompleteLabelFilterDescriptor.AttributeBuilder#enabled()
|
Key |
TermDocumentMatrixBuilder.titleWordsBoost
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Title word boost.
Gives more weight to words that appeared in org.carrot2.core.Document.TITLE fields. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
2.0
|
Min value |
0.0
|
Max value |
10.0
|
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#titleWordsBoost()
|
Key |
CompleteLabelFilter.labelOverrideThreshold
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Truncated label threshold. Determines the strength of the truncated label filter. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.65
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
CompleteLabelFilterDescriptor.AttributeBuilder#labelOverrideThreshold()
|
Key |
TermDocumentMatrixReducer.factorizationFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Factorization method. The method to be used to factorize the term-document matrix and create base vectors that will give rise to cluster labels. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.matrix.factorization.IMatrixFactorizationFactory
|
Default value |
org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory
|
Allowed value types |
Allowed value types:
|
Attribute builder |
TermDocumentMatrixReducerDescriptor.AttributeBuilder#factorizationFactory()
|
Key |
TermDocumentMatrixReducer.factorizationQuality
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Factorization quality. The number of iterations of matrix factorization to perform. The higher the required quality, the more time-consuming clustering. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.matrix.factorization.IterationNumberGuesser$FactorizationQuality
|
Default value |
HIGH
|
Allowed values |
|
Attribute builder |
TermDocumentMatrixReducerDescriptor.AttributeBuilder#factorizationQuality()
|
Key |
TermDocumentMatrixBuilder.maximumMatrixSize
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum matrix size. The maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
37500
|
Min value |
5000
|
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#maximumMatrixSize()
|
Key |
TermDocumentMatrixBuilder.maxWordDf
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum word document frequency.
The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4 , words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear. This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.9
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#maxWordDf()
|
Key |
TermDocumentMatrixBuilder.termWeighting
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Term weighting. The method for calculating weight of words in the term-document matrices. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.text.vsm.ITermWeighting
|
Default value |
org.carrot2.text.vsm.LogTfIdfTermWeighting
|
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
TermDocumentMatrixBuilderDescriptor.AttributeBuilder#termWeighting()
|
Key |
MultilingualClustering.defaultLanguage
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Default clustering language.
The default language to use for documents with undefined org.carrot2.core.Document.LANGUAGE . |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.core.LanguageCode
|
Default value |
ENGLISH
|
Allowed values |
|
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#defaultLanguage()
|
Key |
MultilingualClustering.languageCounts
|
Direction |
Output
|
Description | Document languages. The number of documents in each language. Empty string key means unknown language. |
Scope | Processing time |
Value type |
java.util.Map
|
Default value | none |
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#languageCounts()
|
Key |
MultilingualClustering.languageAggregationStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Language aggregation strategy.
Determines how clusters generated for individual languages should be combined to form the final result. Please see org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy for the list of available options. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
|
Default value |
FLATTEN_MAJOR_LANGUAGE
|
Allowed values |
|
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#languageAggregationStrategy()
|
Key |
MultilingualClustering.majorityLanguage
|
Direction |
Output
|
Description | Majority language.
If org.carrot2.text.clustering.MultilingualClustering.languageAggregationStrategy is org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.CLUSTER_IN_MAJORITY_LANGUAGE , this attribute will provide the majority language that was used to cluster all the documents. If the majority of the documents have undefined language, this attribute will be empty and the clustering will be performed in the org.carrot2.text.clustering.MultilingualClustering.defaultLanguage . |
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#majorityLanguage()
|
Key |
PhraseExtractor.dfThreshold
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Phrase Document Frequency threshold.
Phrases appearing in fewer than dfThreshold documents will be ignored. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1
|
Min value |
1
|
Max value |
100
|
Attribute builder |
PhraseExtractorDescriptor.AttributeBuilder#dfThreshold()
|
Key |
Tokenizer.documentFields
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Textual fields of documents that should be tokenized and parsed for clustering. |
Required |
no
|
Scope | Initialization time |
Value type |
java.util.Collection
|
Default value |
[title, snippet]
|
Attribute builder |
TokenizerDescriptor.AttributeBuilder#documentFields()
|
Key |
DocumentAssigner.exactPhraseAssignment
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Only exact phrase assignments. Assign only documents that contain the label in its original form, including the order of words. Enabling this option will cause less documents to be put in clusters, which result in higher precision of assignment, but also a larger "Other Topics" group. Disabling this option will cause more documents to be put in clusters, which will make the "Other Topics" cluster smaller, but also lower the precision of cluster-document assignments. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
DocumentAssignerDescriptor.AttributeBuilder#exactPhraseAssignment()
|
Key |
PreprocessingPipeline.lexicalDataFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Lexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.ILexicalDataFactory
|
Default value |
org.carrot2.text.linguistic.DefaultLexicalDataFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#lexicalDataFactory()
|
Key |
merge-resources
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Merges stop words and stop labels from all known languages.
If set to false , only stop words and stop labels of the active language will be used. If set to true , stop words from all org.carrot2.core.LanguageCode s will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#mergeResources()
|
Key |
DocumentAssigner.minClusterSize
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Determines the minimum number of documents in each cluster. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
2
|
Min value |
1
|
Max value |
100
|
Attribute builder |
DocumentAssignerDescriptor.AttributeBuilder#minClusterSize()
|
Key |
reload-resources
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Reloads cached stop words and stop labels on every processing request.
For best performance, lexical resource reloading should be disabled in production. This flag is reset to |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#reloadResources()
|
Key |
resource-lookup
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Lexical resource lookup facade. By default, resources are sought in the current thread's context class loader. An override of this attribute is possible both at the initialization time and at processing time. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.util.resource.ResourceLookup
|
Default value |
org.carrot2.util.resource.ResourceLookup
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#resourceLookup()
|
Key |
PreprocessingPipeline.stemmerFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Stemmer factory. Creates the stemmers to be used by the clustering algorithm. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.IStemmerFactory
|
Default value |
org.carrot2.text.linguistic.DefaultStemmerFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#stemmerFactory()
|
Key |
PreprocessingPipeline.tokenizerFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.ITokenizerFactory
|
Default value |
org.carrot2.text.linguistic.DefaultTokenizerFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#tokenizerFactory()
|
Key |
CaseNormalizer.dfThreshold
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Word Document Frequency threshold.
Words appearing in fewer than dfThreshold documents will be ignored. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1
|
Min value |
1
|
Max value |
100
|
Attribute builder |
CaseNormalizerDescriptor.AttributeBuilder#dfThreshold()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
LingoClusteringAlgorithmDescriptor.AttributeBuilder#query()
|
Key |
clusters
|
Direction |
Output
|
Description | Clusters created by the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
LingoClusteringAlgorithmDescriptor.AttributeBuilder#clusters()
|
Key |
LingoClusteringAlgorithm.preprocessingPipeline
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Common preprocessing tasks handler, contains bindable attributes. |
Required |
no
|
Scope | Initialization time |
Value type |
org.carrot2.text.preprocessing.pipeline.IPreprocessingPipeline
|
Default value |
org.carrot2.text.preprocessing.pipeline.CompletePreprocessingPipeline
|
Attribute builder |
LingoClusteringAlgorithmDescriptor.AttributeBuilder#preprocessingPipeline()
|
Key |
STCClusteringAlgorithm.documentCountBoost
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Document count boost. A factor in calculation of the base cluster score, boosting the score depending on the number of documents found in the base cluster. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
1.0
|
Min value |
0.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#documentCountBoost()
|
Key |
STCClusteringAlgorithm.maxBaseClusters
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum base clusters count. Trims the base cluster array after N-th position for the merging phase. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
300
|
Min value |
2
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#maxBaseClusters()
|
Key |
STCClusteringAlgorithm.minBaseClusterScore
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Minimum base cluster score. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
2.0
|
Min value |
0.0
|
Max value |
10.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#minBaseClusterScore()
|
Key |
STCClusteringAlgorithm.minBaseClusterSize
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Minimum documents per base cluster. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
2
|
Min value |
2
|
Max value |
20
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#minBaseClusterSize()
|
Key |
STCClusteringAlgorithm.optimalPhraseLength
|
Direction |
Input
|
Level |
BASIC
|
Description | Optimal label length. A factor in calculation of the base cluster score. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
3
|
Min value |
1
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#optimalPhraseLength()
|
Key |
STCClusteringAlgorithm.optimalPhraseLengthDev
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Phrase length tolerance. A factor in calculation of the base cluster score. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
2.0
|
Min value |
0.5
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#optimalPhraseLengthDev()
|
Key |
STCClusteringAlgorithm.singleTermBoost
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Single term boost. A factor in calculation of the base cluster score. If greater then zero, single-term base clusters are assigned this value regardless of the penalty function. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.5
|
Min value |
0.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#singleTermBoost()
|
Key |
STCClusteringAlgorithm.mergeStemEquivalentBaseClusters
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Merge all stem-equivalent base clusters before running the merge phase. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#mergeStemEquivalentBaseClusters()
|
Key |
STCClusteringAlgorithm.scoreWeight
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Balance between cluster score and size during cluster sorting. Value equal to 0.0 will sort clusters based only on cluster size. Value equal to 1.0 will sort clusters based only on cluster score. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
1.0
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#scoreWeight()
|
Key |
documents
|
Direction |
Input
|
Level |
BASIC
|
Description | Documents to cluster. |
Required |
yes
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#documents()
|
Key |
STCClusteringAlgorithm.maxPhraseOverlap
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum cluster phrase overlap. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.6
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#maxPhraseOverlap()
|
Key |
STCClusteringAlgorithm.maxPhrases
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum phrases per label. Maximum number of phrases from base clusters promoted to the cluster's label. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
3
|
Min value |
1
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#maxPhrases()
|
Key |
STCClusteringAlgorithm.maxDescPhraseLength
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum words per label. Base clusters formed by phrases with more words than this ratio are trimmed. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
4
|
Min value |
1
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#maxDescPhraseLength()
|
Key |
STCClusteringAlgorithm.mostGeneralPhraseCoverage
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Minimum general phrase coverage. Minimum phrase coverage to appear in cluster description. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.5
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#mostGeneralPhraseCoverage()
|
Key |
STCClusteringAlgorithm.mergeThreshold
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Base cluster merge threshold. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.6
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#mergeThreshold()
|
Key |
STCClusteringAlgorithm.maxClusters
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum final clusters. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
15
|
Min value |
1
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#maxClusters()
|
Key |
MultilingualClustering.defaultLanguage
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Default clustering language.
The default language to use for documents with undefined org.carrot2.core.Document.LANGUAGE . |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.core.LanguageCode
|
Default value |
ENGLISH
|
Allowed values |
|
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#defaultLanguage()
|
Key |
MultilingualClustering.languageCounts
|
Direction |
Output
|
Description | Document languages. The number of documents in each language. Empty string key means unknown language. |
Scope | Processing time |
Value type |
java.util.Map
|
Default value | none |
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#languageCounts()
|
Key |
MultilingualClustering.languageAggregationStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Language aggregation strategy.
Determines how clusters generated for individual languages should be combined to form the final result. Please see org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy for the list of available options. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.text.clustering.MultilingualClustering$LanguageAggregationStrategy
|
Default value |
FLATTEN_MAJOR_LANGUAGE
|
Allowed values |
|
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#languageAggregationStrategy()
|
Key |
MultilingualClustering.majorityLanguage
|
Direction |
Output
|
Description | Majority language.
If org.carrot2.text.clustering.MultilingualClustering.languageAggregationStrategy is org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.CLUSTER_IN_MAJORITY_LANGUAGE , this attribute will provide the majority language that was used to cluster all the documents. If the majority of the documents have undefined language, this attribute will be empty and the clustering will be performed in the org.carrot2.text.clustering.MultilingualClustering.defaultLanguage . |
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
MultilingualClusteringDescriptor.AttributeBuilder#majorityLanguage()
|
Key |
Tokenizer.documentFields
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Textual fields of documents that should be tokenized and parsed for clustering. |
Required |
no
|
Scope | Initialization time |
Value type |
java.util.Collection
|
Default value |
[title, snippet]
|
Attribute builder |
TokenizerDescriptor.AttributeBuilder#documentFields()
|
Key |
PreprocessingPipeline.lexicalDataFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Lexical data factory. Creates the lexical data to be used by the clustering algorithm, including stop word and stop label dictionaries. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.ILexicalDataFactory
|
Default value |
org.carrot2.text.linguistic.DefaultLexicalDataFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#lexicalDataFactory()
|
Key |
merge-resources
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Merges stop words and stop labels from all known languages.
If set to false , only stop words and stop labels of the active language will be used. If set to true , stop words from all org.carrot2.core.LanguageCode s will be used together and stop labels from all languages will be used together, no matter the active language. Lexical resource merging is useful when clustering data in a mix of different languages and should increase clustering quality in such settings. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#mergeResources()
|
Key |
reload-resources
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Reloads cached stop words and stop labels on every processing request.
For best performance, lexical resource reloading should be disabled in production. This flag is reset to |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#reloadResources()
|
Key |
resource-lookup
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Lexical resource lookup facade. By default, resources are sought in the current thread's context class loader. An override of this attribute is possible both at the initialization time and at processing time. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.util.resource.ResourceLookup
|
Default value |
org.carrot2.util.resource.ResourceLookup
|
Attribute builder |
DefaultLexicalDataFactoryDescriptor.AttributeBuilder#resourceLookup()
|
Key |
PreprocessingPipeline.stemmerFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Stemmer factory. Creates the stemmers to be used by the clustering algorithm. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.IStemmerFactory
|
Default value |
org.carrot2.text.linguistic.DefaultStemmerFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#stemmerFactory()
|
Key |
PreprocessingPipeline.tokenizerFactory
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Tokenizer factory. Creates the tokenizers to be used by the clustering algorithm. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.text.linguistic.ITokenizerFactory
|
Default value |
org.carrot2.text.linguistic.DefaultTokenizerFactory
|
Attribute builder |
BasicPreprocessingPipelineDescriptor.AttributeBuilder#tokenizerFactory()
|
Key |
CaseNormalizer.dfThreshold
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Word Document Frequency threshold.
Words appearing in fewer than dfThreshold documents will be ignored. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1
|
Min value |
1
|
Max value |
100
|
Attribute builder |
CaseNormalizerDescriptor.AttributeBuilder#dfThreshold()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query that produced the documents. The query will help the algorithm to create better clusters. Therefore, providing the query is optional but desirable. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#query()
|
Key |
clusters
|
Direction |
Output
|
Description | Clusters created by the algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#clusters()
|
Key |
STCClusteringAlgorithm.ignoreWordIfInHigherDocsPercent
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Maximum word-document ratio. A number between 0 and 1, if a word exists in more snippets than this ratio, it is ignored. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Double
|
Default value |
0.9
|
Min value |
0.0
|
Max value |
1.0
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#ignoreWordIfInHigherDocsPercent()
|
Key |
STCClusteringAlgorithm.ignoreWordIfInFewerDocs
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Minimum word-document recurrences. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
2
|
Min value |
2
|
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#ignoreWordIfInFewerDocs()
|
Key |
STCClusteringAlgorithm.preprocessingPipeline
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Common preprocessing tasks handler. |
Required |
no
|
Scope | Initialization time |
Value type |
org.carrot2.text.preprocessing.pipeline.IPreprocessingPipeline
|
Default value |
org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline
|
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
STCClusteringAlgorithmDescriptor.AttributeBuilder#preprocessingPipeline()
|
eTools document source searches the web using etools.ch metasearch engine
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
EToolsDocumentSource.country
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Determines the country of origin for the returned search results. |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.etools.EToolsDocumentSource$Country
|
Default value |
ALL
|
Allowed values |
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#country()
|
Key |
EToolsDocumentSource.language
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Determines the language of the returned search results. |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.etools.EToolsDocumentSource$Language
|
Default value |
ENGLISH
|
Allowed values |
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#language()
|
Key |
EToolsDocumentSource.safeSearch
|
Direction |
Input
|
Level |
BASIC
|
Description | If enabled, excludes offensive content from the results. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#safeSearch()
|
Key |
EToolsDocumentSource.site
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Site URL or comma-separated list of site site URLs to which the returned results should be restricted.
For example: wikipedia.org or en.wikipedia.org,de.wikipedia.org . Very larger lists of site restrictions (larger than 2000 characters) may result in a processing exception. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#site()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
EToolsDocumentSource.customerId
|
Direction |
Input
|
Level |
MEDIUM
|
Description | eTools customer identifier.
For commercial use of eTools, please e-mail: contact@comcepta.com to obtain your customer identifier. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#customerId()
|
Key |
EToolsDocumentSource.dataSources
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Determines which data sources to search. |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.etools.EToolsDocumentSource$DataSources
|
Default value |
ALL
|
Allowed values |
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#dataSources()
|
Key |
XmlDocumentSourceHelper.timeout
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Data transfer timeout. Specifies the data transfer timeout, in seconds. A timeout value of zero is interpreted as an infinite timeout. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
8
|
Min value |
0
|
Max value |
300
|
Attribute builder |
XmlDocumentSourceHelperDescriptor.AttributeBuilder#timeout()
|
Key |
org.carrot2.source.xml.RemoteXmlSimpleSearchEngineBase.redirectStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | HTTP redirect response strategy (follow or throw an error). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.util.httpclient.HttpRedirectStrategy
|
Default value |
NO_REDIRECTS
|
Allowed values |
|
Attribute builder |
RemoteXmlSimpleSearchEngineBaseDescriptor.AttributeBuilder#redirectStrategy()
|
Key |
EToolsDocumentSource.partnerId
|
Direction |
Input
|
Level |
ADVANCED
|
Description | eTools partner identifier. If you have commercial arrangements with eTools, specify your partner id here. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
Carrot2
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#partnerId()
|
Key |
EToolsDocumentSource.serviceUrlBase
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Base URL for the eTools service. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
https://www.etools.ch/partnerSearch.do
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#serviceUrlBase()
|
Key |
EToolsDocumentSource.timeout
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum time in milliseconds to wait for all data sources to return results. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
4000
|
Min value |
0
|
Attribute builder |
EToolsDocumentSourceDescriptor.AttributeBuilder#timeout()
|
Searches the Web using Bing Search
Key |
search-mode
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Search mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
|
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
Default value |
CONSERVATIVE
|
Allowed values |
|
Attribute builder |
MultipageSearchEngineDescriptor.AttributeBuilder#searchMode()
|
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
Bing7DocumentSource.market
|
Direction |
Input
|
Level |
BASIC
|
Description | Language and country/region information for the request. |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.microsoft.v7.MarketOption
|
Default value |
ENGLISH_UNITED_STATES
|
Allowed values |
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#market()
|
Key |
Bing7DocumentSource.adult
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Adult search restriction (porn filter). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.microsoft.v7.AdultOption
|
Default value | none |
Allowed values |
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#adult()
|
Key |
Bing7DocumentSource.site
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Site restriction to return value under a given URL.
Example: http://www.wikipedia.org or simply wikipedia.org . |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#site()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
Bing7DocumentSource.apiKey
|
Direction |
Input
|
Level |
BASIC
|
Description | The API key used to authenticate requests.
You will have to provide your own API key. There is a free monthly grace request limit. By default takes the system property's value under key: |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#apiKey()
|
Key |
Bing7DocumentSource.redirectStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | HTTP redirect response strategy (follow or throw an error). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.util.httpclient.HttpRedirectStrategy
|
Default value |
NO_REDIRECTS
|
Allowed values |
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#redirectStrategy()
|
Key |
Bing7DocumentSource.respectRateLimits
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Respect official guidelines concerning rate limits. If set to false, rate limits are not observed. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#respectRateLimits()
|
Searches news using Bing Search
Key |
search-mode
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Search mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
|
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
Default value |
CONSERVATIVE
|
Allowed values |
|
Attribute builder |
MultipageSearchEngineDescriptor.AttributeBuilder#searchMode()
|
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
Bing7NewsDocumentSource.freshness
|
Direction |
Input
|
Level |
BASIC
|
Description | Filter news by age. |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.microsoft.v7.Freshness
|
Default value | none |
Allowed values |
|
Attribute builder |
Bing7NewsDocumentSourceDescriptor.AttributeBuilder#freshness()
|
Key |
Bing7DocumentSource.market
|
Direction |
Input
|
Level |
BASIC
|
Description | Language and country/region information for the request. |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.microsoft.v7.MarketOption
|
Default value |
ENGLISH_UNITED_STATES
|
Allowed values |
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#market()
|
Key |
Bing7DocumentSource.adult
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Adult search restriction (porn filter). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.microsoft.v7.AdultOption
|
Default value | none |
Allowed values |
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#adult()
|
Key |
Bing7DocumentSource.site
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Site restriction to return value under a given URL.
Example: http://www.wikipedia.org or simply wikipedia.org . |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#site()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
Bing7DocumentSource.apiKey
|
Direction |
Input
|
Level |
BASIC
|
Description | The API key used to authenticate requests.
You will have to provide your own API key. There is a free monthly grace request limit. By default takes the system property's value under key: |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#apiKey()
|
Key |
Bing7DocumentSource.redirectStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | HTTP redirect response strategy (follow or throw an error). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.util.httpclient.HttpRedirectStrategy
|
Default value |
NO_REDIRECTS
|
Allowed values |
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#redirectStrategy()
|
Key |
Bing7DocumentSource.respectRateLimits
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Respect official guidelines concerning rate limits. If set to false, rate limits are not observed. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
Bing7DocumentSourceDescriptor.AttributeBuilder#respectRateLimits()
|
Searches the PubMed medical abstracts database
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
PubMedDocumentSource.toolName
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Tool name, if registered. |
Required |
no
|
Scope | Initialization time |
Value type |
java.lang.String
|
Default value |
Carrot Search
|
Attribute builder |
PubMedDocumentSourceDescriptor.AttributeBuilder#toolName()
|
Key |
PubMedDocumentSource.maxResults
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Maximum results to fetch. No more than the specified number of results will be fetched from PubMed, regardless of the requested number of results. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
150
|
Min value |
1
|
Attribute builder |
PubMedDocumentSourceDescriptor.AttributeBuilder#maxResults()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
PubMedDocumentSource.redirectStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | HTTP redirect response strategy (follow or throw an error). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.util.httpclient.HttpRedirectStrategy
|
Default value |
NO_REDIRECTS
|
Allowed values |
|
Attribute builder |
PubMedDocumentSourceDescriptor.AttributeBuilder#redirectStrategy()
|
XML document source retrieves documents from local XML files or remote XML streams. It can optionally apply an XSLT transformation to convert the XML to the required format.
Key |
documents
|
Direction |
Output
|
Description | Documents read from the XML data. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#documents()
|
Key |
query
|
Direction |
Input
and
Output
|
Level |
BASIC
|
Description | After processing this field may hold the query read from the XML data, if any.
For the semantics of this field on input, see org.carrot2.source.xml.XmlDocumentSource.xml . |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#query()
|
Key |
XmlDocumentSource.readAll
|
Direction |
Input
|
Level |
BASIC
|
Description | If true , all documents are read from the input XML stream, regardless of the limit set by org.carrot2.source.xml.XmlDocumentSource.results .
|
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#readAll()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | The maximum number of documents to read from the XML data if org.carrot2.source.xml.XmlDocumentSource.readAll is false .
The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#results()
|
Key |
clusters
|
Direction |
Input
and
Output
|
Level |
BASIC
|
Description | If org.carrot2.source.xml.XmlDocumentSource.readClusters is true and clusters are present in the input XML, they will be deserialized and exposed to components further down the processing chain.
|
Required |
no
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#clusters()
|
Key |
processing-result.title
|
Direction |
Output
|
Description | The title (file name or query attribute, if present) for the search result fetched from the resource. A typical title for a processing result will be the query used to fetch documents from that source. For certain document sources the query may not be needed (on-disk XML, feed of syndicated news); in such cases, the input component should set its title properly for visual interfaces such as the workbench. |
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#title()
|
Key |
XmlDocumentSourceHelper.timeout
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Data transfer timeout. Specifies the data transfer timeout, in seconds. A timeout value of zero is interpreted as an infinite timeout. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
8
|
Min value |
0
|
Max value |
300
|
Attribute builder |
XmlDocumentSourceHelperDescriptor.AttributeBuilder#timeout()
|
Key |
XmlDocumentSource.xmlParameters
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Values for custom placeholders in the XML URL.
If the type of resource provided in the org.carrot2.source.xml.XmlDocumentSource.xml attribute is org.carrot2.util.resource.URLResourceWithParams , this map provides values for custom placeholders found in the XML URL. Keys of the map correspond to placeholder names, values of the map will be used to replace the placeholders. Please see org.carrot2.source.xml.XmlDocumentSource.xml for the placeholder syntax. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.util.Map
|
Default value |
{}
|
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#xmlParameters()
|
Key |
XmlDocumentSource.xml
|
Direction |
Input
|
Level |
BASIC
|
Description | The resource to load XML data from.
You can either create instances of org.carrot2.util.resource.IResource implementations directly or use org.carrot2.util.resource.ResourceLookup to look up org.carrot2.util.resource.IResource instances from a variety of locations. One special
Additionally, custom placeholders can be used. Values for the custom placeholders should be provided in the |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.util.resource.IResource
|
Default value | none |
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#xml()
|
Key |
XmlDocumentSource.readClusters
|
Direction |
Input
|
Level |
BASIC
|
Description | If clusters are present in the input XML they will be read and exposed to components further down the processing chain. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#readClusters()
|
Key |
XmlDocumentSource.xsltParameters
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Parameters to be passed to the XSLT transformer. Keys of the map will be used as parameter names, values of the map as parameter values. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.util.Map
|
Default value |
{}
|
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#xsltParameters()
|
Key |
XmlDocumentSource.xslt
|
Direction |
Input
|
Level |
MEDIUM
|
Description | The resource to load XSLT stylesheet from.
The XSLT stylesheet is optional and is useful when the source XML stream does not follow the Carrot2 format. The XSLT transformation will be applied to the source XML stream, the transformed XML stream will be deserialized into org.carrot2.core.Document s. The XSLT To pass additional parameters to the XSLT transformer, use the |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.util.resource.IResource
|
Default value | none |
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
XmlDocumentSourceDescriptor.AttributeBuilder#xslt()
|
Retrieves documents from the Apache Lucene index. The index directory must be available in the local file system.
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#documents()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.contextFragments
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Number of context fragments for the highlighter. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
3
|
Min value |
1
|
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#contextFragments()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.formatter
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Snippet formatter for the highlighter.
Highlighter is not used if null . |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
org.apache.lucene.search.highlight.Formatter
|
Default value |
org.carrot2.source.lucene.PlainTextFormatter
|
Allowed value types |
Allowed value types:
|
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#formatter()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.fragmentJoin
|
Direction |
Input
|
Level |
ADVANCED
|
Description | A string used to join context fragments when highlighting. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value |
...
|
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#fragmentJoin()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.contentField
|
Direction |
Input
|
Level |
BASIC
|
Description | Document content field name. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#contentField()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.titleField
|
Direction |
Input
|
Level |
BASIC
|
Description | Document title field name. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#titleField()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.urlField
|
Direction |
Input
|
Level |
BASIC
|
Description | Document URL field name. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#urlField()
|
Key |
LuceneDocumentSource.fieldMapper
|
Direction |
Input
|
Level |
ADVANCED
|
Description |
IFieldMapper provides the link between Carrot2 org.carrot2.core.Document fields and Lucene index fields.
|
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
org.carrot2.source.lucene.IFieldMapper
|
Default value |
org.carrot2.source.lucene.SimpleFieldMapper
|
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#fieldMapper()
|
Key |
org.carrot2.source.lucene.SimpleFieldMapper.searchFields
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Index search field names. If not specified, title and content fields are used. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
SimpleFieldMapperDescriptor.AttributeBuilder#searchFields()
|
Key |
LuceneDocumentSource.analyzer
|
Direction |
Input
|
Level |
MEDIUM
|
Description |
org.apache.lucene.analysis.Analyzer used at indexing time.
The same analyzer should be used for querying. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
org.apache.lucene.analysis.Analyzer
|
Default value |
org.apache.lucene.analysis.standard.StandardAnalyzer
|
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#analyzer()
|
Key |
LuceneDocumentSource.directory
|
Direction |
Input
|
Level |
BASIC
|
Description | Search index org.apache.lucene.store.Directory .
Must be unlocked for reading. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
org.apache.lucene.store.Directory
|
Default value | none |
Allowed value types |
Allowed value types:
|
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#directory()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | A pre-parsed org.apache.lucene.search.Query object or a String parsed using the built-in classic QueryParser over a set of search fields returned from the org.carrot2.source.lucene.LuceneDocumentSource.fieldMapper .
|
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.Object
|
Default value | none |
Allowed value types |
Allowed value types:
|
Value content | Must not be blank |
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#results()
|
Key |
LuceneDocumentSource.keepLuceneDocuments
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Keeps references to Lucene document instances in Carrot2 documents.
Please bear in mind two limitations:
|
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#keepLuceneDocuments()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
LuceneDocumentSourceDescriptor.AttributeBuilder#resultsTotal()
|
Solr document source queries an instance of Apache Solr search engine.
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
SolrDocumentSource.copyFields
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Copy Solr fields from the search result to Carrot2 org.carrot2.core.Document instances (as fields).
|
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#copyFields()
|
Key |
SolrDocumentSource.solrXsltAdapter
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Provides a custom XSLT stylesheet for converting from Solr's output to an XML format parsed by Carrot2. For performance reasons this attribute can be provided at initialization time only (no processing-time overrides). |
Required |
no
|
Scope | Initialization time |
Value type |
org.carrot2.util.resource.IResource
|
Default value | none |
Allowed value types | Allowed value types: Other assignable value types are allowed. |
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#solrXsltAdapter()
|
Key |
SolrDocumentSource.solrIdFieldName
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Document identifier field name (specified in Solr schema). This field is necessary to connect Solr-side clusters or highlighter output to documents. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#solrIdFieldName()
|
Key |
SolrDocumentSource.readClusters
|
Direction |
Input
|
Level |
BASIC
|
Description | If clusters are present in the Solr output they will be read and exposed to components further down the processing chain.
Note that org.carrot2.source.solr.SolrDocumentSource.solrIdFieldName is required to match document references. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#readClusters()
|
Key |
SolrDocumentSource.solrSummaryFieldName
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Summary field name. Name of the Solr field that will provide document summary. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
description
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#solrSummaryFieldName()
|
Key |
SolrDocumentSource.solrTitleFieldName
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Title field name. Name of the Solr field that will provide document titles. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
title
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#solrTitleFieldName()
|
Key |
SolrDocumentSource.solrUrlFieldName
|
Direction |
Input
|
Level |
MEDIUM
|
Description | URL field name. Name of the Solr field that will provide document URLs. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
url
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#solrUrlFieldName()
|
Key |
SolrDocumentSource.useHighlighterOutput
|
Direction |
Input
|
Level |
BASIC
|
Description | If highlighter fragments are present in the Solr output they will be used (and preferred) over full field content.
This may be used to decrease the memory required for clustering. In general if highlighter is used the contents of full fields won't be emitted from Solr though (because it makes little sense). Setting this option to |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Boolean
|
Default value |
true
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#useHighlighterOutput()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
clusters
|
Direction |
Input
and
Output
|
Level |
BASIC
|
Description | If org.carrot2.source.solr.SolrDocumentSource.readClusters is true and clusters are present in the input XML, they will be deserialized and exposed to components further down the processing chain.
|
Required |
no
|
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#clusters()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
XmlDocumentSourceHelper.timeout
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Data transfer timeout. Specifies the data transfer timeout, in seconds. A timeout value of zero is interpreted as an infinite timeout. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
8
|
Min value |
0
|
Max value |
300
|
Attribute builder |
XmlDocumentSourceHelperDescriptor.AttributeBuilder#timeout()
|
Key |
SolrDocumentSource.solrFilterQuery
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Filter query appended to org.carrot2.source.solr.SolrDocumentSource.serviceUrlBase .
|
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value |
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#solrFilterQuery()
|
Key |
org.carrot2.source.xml.RemoteXmlSimpleSearchEngineBase.redirectStrategy
|
Direction |
Input
|
Level |
MEDIUM
|
Description | HTTP redirect response strategy (follow or throw an error). |
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.util.httpclient.HttpRedirectStrategy
|
Default value |
NO_REDIRECTS
|
Allowed values |
|
Attribute builder |
RemoteXmlSimpleSearchEngineBaseDescriptor.AttributeBuilder#redirectStrategy()
|
Key |
SolrDocumentSource.serviceUrlBase
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Solr service URL base.
The URL base can contain additional Solr parameters, for example: http://localhost:8983/solr/select?fq=timestemp:[NOW-24HOUR TO NOW]
|
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value |
http://localhost:8983/solr/select
|
Attribute builder |
SolrDocumentSourceDescriptor.AttributeBuilder#serviceUrlBase()
|
Open Search document source retrieves search results from search engines supporting the OpenSearch standard.
Key |
search-mode
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Search mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
|
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
Default value |
SPECULATIVE
|
Allowed values |
|
Attribute builder |
MultipageSearchEngineDescriptor.AttributeBuilder#searchMode()
|
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
OpenSearchDocumentSource.feedUrlParams
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Additional parameters to be appended to org.carrot2.source.opensearch.OpenSearchDocumentSource.feedUrlTemplate on each request.
|
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.util.Map
|
Default value | none |
Attribute builder |
OpenSearchDocumentSourceDescriptor.AttributeBuilder#feedUrlParams()
|
Key |
OpenSearchDocumentSource.feedUrlTemplate
|
Direction |
Input
|
Level |
BASIC
|
Description | URL to fetch the search feed from.
The URL template can contain variable place holders as defined by the OpenSearch specification that will be replaced during runtime. The format of the place holder is ${variable} . The following variables are supported:
Example URL feed templates for public services:
|
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
OpenSearchDocumentSourceDescriptor.AttributeBuilder#feedUrlTemplate()
|
Key |
OpenSearchDocumentSource.maximumResults
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of results. The maximum number of results the document source can deliver. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
1000
|
Min value |
1
|
Attribute builder |
OpenSearchDocumentSourceDescriptor.AttributeBuilder#maximumResults()
|
Key |
OpenSearchDocumentSource.resultsPerPage
|
Direction |
Input
|
Level |
BASIC
|
Description | Results per page. The number of results per page the document source will expect the feed to return. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
50
|
Min value |
1
|
Attribute builder |
OpenSearchDocumentSourceDescriptor.AttributeBuilder#resultsPerPage()
|
Key |
OpenSearchDocumentSource.userAgent
|
Direction |
Input
|
Level |
ADVANCED
|
Description | User agent header.
The contents of the User-Agent HTTP header to use when making requests to the feed URL. If empty or null value is provided, the following User-Agent will be sent: Rome Client (http://tinyurl.com/64t5n) Ver: UNKNOWN . |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
OpenSearchDocumentSourceDescriptor.AttributeBuilder#userAgent()
|
IDOL document source retrieves search results from Autonomy IDOL search engines supporting the OpenSearch standard.
Key |
search-mode
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Search mode defines how fetchers returned from org.carrot2.source.MultipageSearchEngine.createFetcher are called.
|
Required |
no
|
Scope | Processing time |
Value type |
org.carrot2.source.MultipageSearchEngine$SearchMode
|
Default value |
SPECULATIVE
|
Allowed values |
|
Attribute builder |
MultipageSearchEngineDescriptor.AttributeBuilder#searchMode()
|
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.Collection
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#documents()
|
Key |
query
|
Direction |
Input
|
Level |
BASIC
|
Description | Query to perform. |
Required |
yes
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Value content | Must not be blank |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#results()
|
Key |
start
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Index of the first document/ search result to fetch. The index starts at zero. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Min value |
0
|
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#start()
|
Key |
SearchEngineBase.compressed
|
Direction |
Output
|
Description | Indicates whether the search engine returned a compressed result stream. |
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#compressed()
|
Key |
SearchEngineStats.pageRequests
|
Direction |
Output
|
Description | Number of individual page requests issued by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#pageRequests()
|
Key |
SearchEngineStats.queries
|
Direction |
Output
|
Description | Number queries handled successfully by this data source. |
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value | none |
Attribute builder |
SearchEngineStatsDescriptor.AttributeBuilder#queries()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
SearchEngineBaseDescriptor.AttributeBuilder#resultsTotal()
|
Key |
IdolDocumentSource.idolServerName
|
Direction |
Input
|
Level |
BASIC
|
Description | URL of the IDOL Server. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#idolServerName()
|
Key |
IdolDocumentSource.idolServerPort
|
Direction |
Input
|
Level |
BASIC
|
Description | IDOL Server Port. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
0
|
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#idolServerPort()
|
Key |
IdolDocumentSource.xslTemplateName
|
Direction |
Input
|
Level |
ADVANCED
|
Description | IDOL XSL Template Name. The Reference of an IDOL XSL template that outputs the results in OpenSearch format. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#xslTemplateName()
|
Key |
IdolDocumentSource.maximumResults
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of results. The maximum number of results the document source can deliver. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#maximumResults()
|
Key |
IdolDocumentSource.minScore
|
Direction |
Input
|
Level |
BASIC
|
Description | Minimum IDOL Score. The minimum score of the results returned by IDOL. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
50
|
Min value |
1
|
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#minScore()
|
Key |
IdolDocumentSource.otherSearchAttributes
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Any other search attributes (separated by &) from the Autonomy Query Search API's Ensure all the attributes are entered to satisfy XSL that will be applied. |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#otherSearchAttributes()
|
Key |
IdolDocumentSource.resultsPerPage
|
Direction |
Input
|
Level |
ADVANCED
|
Description | Results per page. The number of results per page the document source will expect the feed to return. |
Required |
yes
|
Scope | Initialization time and Processing time |
Value type |
java.lang.Integer
|
Default value |
50
|
Min value |
1
|
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#resultsPerPage()
|
Key |
IdolDocumentSource.userAgent
|
Direction |
Input
|
Level |
ADVANCED
|
Description | User agent header.
The contents of the User-Agent HTTP header to use when making requests to the feed URL. If empty or null value is provided, the following User-Agent will be sent: Rome Client (http://tinyurl.com/64t5n) Ver: UNKNOWN . |
Required |
no
|
Scope | Initialization time and Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#userAgent()
|
Key |
IdolDocumentSource.userName
|
Direction |
Input
|
Level |
MEDIUM
|
Description | User name to use for authentication. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
IdolDocumentSourceDescriptor.AttributeBuilder#userName()
|
Serves documents from the Ambient test set. Ambient (AMBIgous ENTries) is a data set designed for evaluating subtopic information retrieval. It consists of 44 topics, each with a set of subtopics and a list of 100 ranked documents. For more information, please see: http://credo.fub.it/ambient.
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#documents()
|
Key |
FubDocumentSource.includeDocumentsWithoutTopic
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Include documents without topics. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#includeDocumentsWithoutTopic()
|
Key |
FubDocumentSource.minTopicSize
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Minimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1
|
Min value |
1
|
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#minTopicSize()
|
Key |
query
|
Direction |
Output
|
Description | Query to perform. |
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
100
|
Min value |
1
|
Max value |
100
|
Attribute builder |
AmbientDocumentSourceDescriptor.AttributeBuilder#results()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
AmbientDocumentSourceDescriptor.AttributeBuilder#resultsTotal()
|
Key |
AmbientDocumentSource.topic
|
Direction |
Input
|
Level |
BASIC
|
Description | Ambient Topic. The Ambient Topic to load documents from. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.source.ambient.AmbientDocumentSource$AmbientTopic
|
Default value |
AIDA
|
Allowed values |
|
Attribute builder |
AmbientDocumentSourceDescriptor.AttributeBuilder#topic()
|
Key |
FubDocumentSource.topicIds
|
Direction |
Output
|
Description | Topics and subtopics covered in the output documents.
The set is computed for the output org.carrot2.source.ambient.FubDocumentSource.documents and it may vary for the same main topic based e.g. on the requested number of requested results or org.carrot2.source.ambient.FubDocumentSource.minTopicSize . |
Scope | Processing time |
Value type |
java.util.Set
|
Default value | none |
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#topicIds()
|
Serves documents from the ODP239 test set. ODP239 is a data set designed for evaluating subtopic information retrieval. It consists of 239 topics extracted from the Open Directory Project, each with a set of subtopics and a list of about 100 documents. For more information, please see: http://credo.fub.it/odp239.
Key |
documents
|
Direction |
Output
|
Description | Documents returned by the search engine/ document retrieval system or documents passed as input to the clustering algorithm. |
Scope | Processing time |
Value type |
java.util.List
|
Default value | none |
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#documents()
|
Key |
FubDocumentSource.includeDocumentsWithoutTopic
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Include documents without topics. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Boolean
|
Default value |
false
|
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#includeDocumentsWithoutTopic()
|
Key |
FubDocumentSource.minTopicSize
|
Direction |
Input
|
Level |
MEDIUM
|
Description | Minimum topic size. Documents belonging to a topic with fewer documents than minimum topic size will not be returned. |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1
|
Min value |
1
|
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#minTopicSize()
|
Key |
query
|
Direction |
Output
|
Description | Query to perform. |
Scope | Processing time |
Value type |
java.lang.String
|
Default value | none |
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#query()
|
Key |
results
|
Direction |
Input
|
Level |
BASIC
|
Description | Maximum number of documents/ search results to fetch. The query hint can be used by clustering algorithms to avoid creating trivial clusters (combination of query words). |
Required |
no
|
Scope | Processing time |
Value type |
java.lang.Integer
|
Default value |
1000
|
Min value |
1
|
Max value |
1000
|
Attribute builder |
Odp239DocumentSourceDescriptor.AttributeBuilder#results()
|
Key |
results-total
|
Direction |
Output
|
Description | Estimated total number of matching documents. |
Scope | Processing time |
Value type |
java.lang.Long
|
Default value | none |
Attribute builder |
Odp239DocumentSourceDescriptor.AttributeBuilder#resultsTotal()
|
Key |
Odp239DocumentSource.topic
|
Direction |
Input
|
Level |
BASIC
|
Description | ODP239 Topic. The ODP239 Topic to load documents from. |
Required |
yes
|
Scope | Processing time |
Value type |
org.carrot2.source.ambient.Odp239DocumentSource$Odp239Topic
|
Default value |
ARTS_ANIMATION
|
Allowed values |
|
Attribute builder |
Odp239DocumentSourceDescriptor.AttributeBuilder#topic()
|
Key |
FubDocumentSource.topicIds
|
Direction |
Output
|
Description | Topics and subtopics covered in the output documents.
The set is computed for the output org.carrot2.source.ambient.FubDocumentSource.documents and it may vary for the same main topic based e.g. on the requested number of requested results or org.carrot2.source.ambient.FubDocumentSource.minTopicSize . |
Scope | Processing time |
Value type |
java.util.Set
|
Default value | none |
Attribute builder |
FubDocumentSourceDescriptor.AttributeBuilder#topicIds()
|