See: Description
Package | Description |
---|---|
org.carrot2.core |
Definitions of Carrot2 core interfaces and their implementations.
|
org.carrot2.core.attribute |
Attribute annotations for Carrot2 core interfaces.
|
Package | Description |
---|---|
org.carrot2.source |
Base classes for implementing Carrot2 document sources.
|
org.carrot2.source.ambient |
Serves documents from the Ambient test set.
|
org.carrot2.source.etools |
Fetches documents from the eTools Metasearch Engine.
|
org.carrot2.source.idol |
Fetches documents from an Autonmomy IDOL Search engine with an OpenSearch-compliant feed.
|
org.carrot2.source.lucene |
Fetches documents from a local Lucene index.
|
org.carrot2.source.microsoft.v7 | |
org.carrot2.source.opensearch |
Fetches documents from an OpenSearch-compliant search feed.
|
org.carrot2.source.pubmed |
Fetches documents from the PubMed medical abstracts database.
|
org.carrot2.source.solr |
Fetches documents from the Solr search engine.
|
org.carrot2.source.xml |
Fetches documents from XML streams.
|
Package | Description |
---|---|
org.carrot2.clustering.kmeans |
Implementation of the bisecting k-means clustering algorithm.
|
org.carrot2.clustering.lingo |
Implementation of the Lingo clustering algorithm.
|
org.carrot2.clustering.stc |
Implementation of the STC clustering algorithm.
|
org.carrot2.clustering.synthetic |
Synthetic clustering algorithms.
|
Package | Description |
---|---|
org.carrot2.output.metrics |
Cluster quality metrics calculation utilities.
|
Package | Description |
---|---|
org.carrot2.text.analysis |
Lexical analysis utilities.
|
org.carrot2.text.clustering |
Multilingual clustering utilities.
|
org.carrot2.text.linguistic |
Shallow linguistic processing utilities.
|
org.carrot2.text.linguistic.lucene |
Shallow linguistic processing utilities dependent on Lucene stemmers and analyzers.
|
org.carrot2.text.linguistic.morfologik |
Shallow linguistic processing utilities dependent on the Morfologik stemming library.
|
org.carrot2.text.linguistic.snowball | |
org.carrot2.text.linguistic.snowball.stemmers | |
org.carrot2.text.preprocessing |
Contains the unified input preprocessing infrastructure
(term indexing, stemming, label discovery).
|
org.carrot2.text.preprocessing.filter |
Text feature filtering utilities.
|
org.carrot2.text.preprocessing.pipeline |
Predefined preprocessing pipeline utilities.
|
org.carrot2.text.suffixtree |
Implementation of the suffix tree data structure.
|
org.carrot2.text.util |
Data structures for text preprocessing.
|
org.carrot2.text.vsm |
Vector Space Model utilities.
|
Package | Description |
---|---|
org.carrot2.matrix | |
org.carrot2.matrix.factorization | |
org.carrot2.matrix.factorization.seeding |
Matrix seeding strategies.
|
Package | Description |
---|---|
org.carrot2.util |
Common utility classes.
|
org.carrot2.util.annotations |
Marker annotations.
|
org.carrot2.util.attribute |
Attribute handling utilities.
|
org.carrot2.util.factory |
A simple object factory.
|
org.carrot2.util.httpclient |
Apache Commons HTTP client utilities.
|
org.carrot2.util.pool |
A very simple unbounded pool implementation.
|
org.carrot2.util.resource |
Resource location abstraction layer.
|
org.carrot2.util.simplexml |
Utilities for working with the Simple XML framework.
|
org.carrot2.util.tests |
Unit test utilities and annotations.
|
org.carrot2.util.xslt |
XSLT handling utilities.
|
org.carrot2.util.xsltfilter |
XSLT processor servlet filter.
|
Carrot2 is an Open Source Search Results Clustering Engine, which can automatically organize small collections of documents, for example search results, into thematic categories, see below for more.
Java API JAR, JavaDocs and example code
Other Carrot2 applications
User and Developer Manual
Instructions for Maven2 users
Carrot2 project website
Carrot2 on-line demo
You can use Carrot2 Java API to fetch documents from various sources (public search engines, Lucene, Solr), perform clustering, serialize the results to JSON or XML and many more. Below is some example code for the most common use cases. Please see the examples/ directory in the Java API distribution archive for more examples.
The easiest way to get started with Carrot2 is to cluster a collection
of Document
s. Each document can consist of:
ByUrlClusteringAlgorithm
,
ignored by other algorithms.To make the example short, the code shown below clusters only 5 documents. Use at least 20 to get reasonable clusters. If you have access to the query that generated the documents being clustered, you should also provide it to Carrot2 to get better clusters.
/* A few example documents, normally you would need at least 20 for reasonable clusters. */ final String [][] data = new String [] [] { { "http://en.wikipedia.org/wiki/Data_mining", "Data mining - Wikipedia, the free encyclopedia", "Article about knowledge-discovery in databases (KDD), the practice of automatically searching large stores of data for patterns." }, { "http://www.ccsu.edu/datamining/resources.html", "CCSU - Data Mining", "A collection of Data Mining links edited by the Central Connecticut State University ... Graduate Certificate Program. Data Mining Resources. Resources. Groups ..." }, { "http://www.kdnuggets.com/", "KDnuggets: Data Mining, Web Mining, and Knowledge Discovery", "Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings." }, { "http://en.wikipedia.org/wiki/Data-mining", "Data mining - Wikipedia, the free encyclopedia", "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..." }, { "http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm", "Data Mining: What is Data Mining?", "Outlines what knowledge discovery, the process of analyzing data from different perspectives and summarizing it into useful information, can do and how it works." }, }; /* Prepare Carrot2 documents */ final ArrayList<Document> documents = new ArrayList<Document>(); for (String [] row : data) { documents.add(new Document(row[1], row[2], row[0])); } /* A controller to manage the processing pipeline. */ final Controller controller = ControllerFactory.createPooling(); /* * Perform clustering by topic using the Lingo algorithm. Lingo can * take advantage of the original query, so we provide it along with the documents. */ final ProcessingResult byTopicClusters = controller.process(documents, "data mining", LingoClusteringAlgorithm.class); final List<Cluster> clustersByTopic = byTopicClusters.getClusters(); /* Perform clustering by domain. In this case query is not useful, hence it is null. */ final ProcessingResult byDomainClusters = controller.process(documents, null, ByUrlClusteringAlgorithm.class); final List<Cluster> clustersByDomain = byDomainClusters.getClusters();Full source code: ClusteringDocumentList.java
IDocumentSource
and cluster them using some
IClusteringAlgorithm
. The simplest yet least flexible
way to do it is to use the Controller.process(String, Integer, Class...)
method from the Controller
. The code shown below retrieves
100 search results for query data mining from
EToolsDocumentSource
and clusters them using
the LingoClusteringAlgorithm
.
/* A controller to manage the processing pipeline. */ final Controller controller = ControllerFactory.createPooling(); /* Perform processing */ final ProcessingResult result = controller.process("data mining", 100, EToolsDocumentSource.class, LingoClusteringAlgorithm.class); /* Documents fetched from the document source, clusters created by Carrot2. */ final List<Document> documents = result.getDocuments(); final List<Cluster> clusters = result.getClusters();Full source code: ClusteringDataFromDocumentSources.java
If your production code needs to fetch documents from popular search engines,
it is very important that you generate and use your own API key rather than Carrot2's
default one. You can pass the API key along with the query and the requested
number of results in an attribute map. Carrot2 manual lists all supported attributes
along with their keys, types and allowed values. The code shown below, fetches and clusters
50 results from Bing7DocumentSource
.
/* A controller to manage the processing pipeline. */ final Controller controller = ControllerFactory.createPooling(); /* Prepare attributes */ final Map<String, Object> attributes = new HashMap<String, Object>(); /* Put your own API key here! */ Bing7DocumentSourceDescriptor.attributeBuilder(attributes) .apiKey(BingKeyAccess.getKey()); /* Query an the required number of results */ attributes.put(CommonAttributesDescriptor.Keys.QUERY, "clustering"); attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 50); /* Perform processing */ final ProcessingResult result = controller.process(attributes, Bing7DocumentSource.class, STCClusteringAlgorithm.class); /* Documents fetched from the document source, clusters created by Carrot2. */ final List<Document> documents = result.getDocuments(); final List<Cluster> clusters = result.getClusters();Full source code: ClusteringDataFromDocumentSources.java
You can change the default behaviour of clustering algorithms and document sources by changing their attributes. For a complete list of available attributes, their identifiers, types and allowed values, please see Carrot2 manual.
To pass attributes to Carrot2, put them into a Map
,
along with query or documents being clustered. The code shown below searches the
web using Bing7DocumentSource
and clusters the results using LingoClusteringAlgorithm
customized to create fewer clusters than by default.
/* A controller to manage the processing pipeline. */ final Controller controller = ControllerFactory.createPooling(); /* Prepare attribute map */ final Map<String, Object> attributes = new HashMap<String, Object>(); /* Put attribute values using direct keys. */ attributes.put(CommonAttributesDescriptor.Keys.QUERY, "data mining"); attributes.put(CommonAttributesDescriptor.Keys.RESULTS, 100); attributes.put("LingoClusteringAlgorithm.desiredClusterCountBase", 15); /* Put your own API key here! */ attributes.put(Bing7DocumentSourceDescriptor.Keys.API_KEY, BingKeyAccess.getKey()); /* Perform processing */ final ProcessingResult result = controller.process(attributes, Bing7DocumentSource.class, LingoClusteringAlgorithm.class); /* Documents fetched from the document source, clusters created by Carrot2. */ final List<Document> documents = result.getDocuments(); final List<Cluster> clusters = result.getClusters();Full source code: UsingAttributes.java
As an alternative to the raw attribute map used in the previous example, you can use attribute map builders. Attribute map builders have a number of advantages:
A possible disadvantage of attribute builders is that one algorithm's attributes can be divided into a number of builders and hence not readily available in your IDE's auto complete window. Please consult attribute documentation in Carrot2 manual for pointers to the appropriate builder classes and methods.
The code shown below fetches 100 results for query data mining from
Bing7DocumentSource
and clusters them using
the LingoClusteringAlgorithm
tuned to create slightly
fewer clusters than by default. Please note how the API key is passed and use your own
key in production deployments.
/* A controller to manage the processing pipeline. */ final Controller controller = ControllerFactory.createPooling(); /* Prepare attribute map */ final Map<String, Object> attributes = new HashMap<String, Object>(); /* Put values using attribute builders */ CommonAttributesDescriptor .attributeBuilder(attributes) .query("data mining") .results(100); LingoClusteringAlgorithmDescriptor .attributeBuilder(attributes) .desiredClusterCountBase(15) .matrixReducer() .factorizationQuality(FactorizationQuality.HIGH); Bing7DocumentSourceDescriptor .attributeBuilder(attributes) .apiKey(BingKeyAccess.getKey()); // use your own key here /* Perform processing */ final ProcessingResult result = controller.process(attributes, Bing7DocumentSource.class, LingoClusteringAlgorithm.class); /* Documents fetched from the document source, clusters created by Carrot2. */ final List<Document> documents = result.getDocuments(); final List<Cluster> clusters = result.getClusters();Full source code: UsingAttributes.java
Some algorithms apart from clusters can produce additional, usually
diagnostic, output. The output is present in the attributes map contained
in the ProcessingResult
. You can read the contents
of that map directly or through the attribute map builders. Carrot2 manual
lists and describes in detail the output attributes of each component.
The code shown below clusters clusters an example collection of
Document
s using the Lingo algorithm. Lingo can
optionally use native platform-specific matrix computation libraries. The
example code reads an attribute to find out whether such libraries were
successfully loaded and used.
/* A controller to manage the processing pipeline. */ final Controller controller = ControllerFactory.createPooling(); /* Prepare attribute map */ final Map<String, Object> attributes = new HashMap<String, Object>(); CommonAttributesDescriptor .attributeBuilder(attributes) .documents(SampleDocumentData.DOCUMENTS_DATA_MINING); LingoClusteringAlgorithmDescriptor .attributeBuilder(attributes) .desiredClusterCountBase(15) .matrixReducer() .factorizationQuality(FactorizationQuality.HIGH); /* Perform processing */ final ProcessingResult result = controller.process(attributes, LingoClusteringAlgorithm.class); /* Clusters created by Carrot2, read processing time */ final List<Cluster> clusters = result.getClusters(); final Long clusteringTime = CommonAttributesDescriptor.attributeBuilder( result.getAttributes()).processingTimeAlgorithm();Full source code: UsingAttributes.java
The examples shown above used a simple controller to manage the clustering process. While the simple controller is enough for one-shot requests, for long-running applications, such as web applications, it's better to use a controller which supports pooling of processing component instances and caching of processing results.
/* * Create the caching controller. You need only one caching controller instance * per application life cycle. This controller instance will cache the results * fetched from any document source and also clusters generated by the Lingo * algorithm. */ final Controller controller = ControllerFactory.createCachingPooling( IDocumentSource.class, LingoClusteringAlgorithm.class); /* * Before using the caching controller, you must initialize it. On initialization, * you can set default values for some attributes. In this example, we'll set the * default results number to 50 and the API key. */ final Map<String, Object> globalAttributes = new HashMap<String, Object>(); CommonAttributesDescriptor .attributeBuilder(globalAttributes) .results(50); Bing7DocumentSourceDescriptor .attributeBuilder(globalAttributes) .apiKey(BingKeyAccess.getKey()); // use your own ID here controller.init(globalAttributes); /* * The controller is now ready to perform queries. To show that the documents from * the document input are cached, we will perform the same query twice and measure * the time for each query. */ ProcessingResult result; long start, duration; final Map<String, Object> attributes; attributes = new HashMap<String, Object>(); CommonAttributesDescriptor.attributeBuilder(attributes).query("data mining"); start = System.currentTimeMillis(); result = controller.process(attributes, Bing7DocumentSource.class, LingoClusteringAlgorithm.class); duration = System.currentTimeMillis() - start; System.out.println(duration + " ms (empty cache)"); start = System.currentTimeMillis(); result = controller.process(attributes, Bing7DocumentSource.class, LingoClusteringAlgorithm.class); duration = System.currentTimeMillis() - start; System.out.println(duration + " ms (documents and clusters from cache)");Full source code: UsingCachingController.java
This example shows how to cluster non-English content. By default Carrot2 assumes that the documents provided for clustering are written in English. When clustering content written in some different language, it is important to indicate the language to Carrot2, so that it can use the lexical resources (stop words, tokenizer, stemmer) appropriate for that language.
There are two ways to indicate the desired clustering language to Carrot2:
Document.LANGUAGE
field. The language does not necessarily
have to be the same for all documents on the input, Carrot2 can handle multiple
languages in one document set as well. Please see the
MultilingualClustering.languageAggregationStrategy
attribute for more details.Document.LANGUAGE
field, Carrot2 will assume the some fallback
language, which is English by default. You can change the fallback language by setting
the MultilingualClustering.defaultLanguage
attribute.Document.LANGUAGE
of documents they produce based on their
specific language-related attributes. Currently, three documents support this scenario:
Bing7DocumentSource
through the
Bing7DocumentSource.market
attribute,EToolsDocumentSource
through the
EToolsDocumentSource.language
attribute.MultilingualClustering.defaultLanguage
attribute.
/* * We use a Controller that reuse instances of Carrot2 processing components * and caches results produced by document sources. */ final Controller controller = ControllerFactory.createCachingPooling(IDocumentSource.class); /* * In the first call, we'll cluster a document list, setting the language for each * document separately. */ final List<Document> documents = Lists.newArrayList(); for (Document document : SampleDocumentData.DOCUMENTS_DATA_MINING) { documents.add(new Document(document.getTitle(), document.getSummary(), document.getContentUrl(), LanguageCode.ENGLISH)); } final Map<String, Object> attributes = Maps.newHashMap(); CommonAttributesDescriptor.attributeBuilder(attributes) .documents(documents); final ProcessingResult englishResult = controller.process( attributes, LingoClusteringAlgorithm.class); ConsoleFormatter.displayResults(englishResult); /* * In the second call, we will fetch results for a Chinese query from Bing, * setting explicitly the Bing's specific language attribute. Based on that * attribute, the document source will set the appropriate language for each * document. */ attributes.clear(); CommonAttributesDescriptor.attributeBuilder(attributes) .query("聚类" /* clustering? */) .results(100); Bing7DocumentSourceDescriptor.attributeBuilder(attributes) .market(MarketOption.CHINESE_CHINA) .apiKey(BingKeyAccess.getKey()); // use your own ID here! final ProcessingResult chineseResult = controller.process(attributes, Bing7DocumentSource.class, LingoClusteringAlgorithm.class); ConsoleFormatter.displayResults(chineseResult);Full source code: ClusteringNonEnglishContent.java