public final class Cluster extends Object
Document
s. Each cluster has a human-readable label
consisting of one or more phrases, a list of documents it contains and a list of its
subclusters. Optionally, additional attributes can be associated with a cluster, e.g.
OTHER_TOPICS
. This class is not thread-safe.Modifier and Type | Field and Description |
---|---|
static Comparator<Cluster> |
BY_LABEL_COMPARATOR
Compares clusters by the natural order of their labels as returned by
getLabel() . |
static Comparator<Cluster> |
BY_REVERSED_SCORE_AND_LABEL_COMPARATOR
Compares clusters first by their size as returned by
SCORE and labels as
returned by getLabel() . |
static Comparator<Cluster> |
BY_REVERSED_SIZE_AND_LABEL_COMPARATOR
Compares clusters first by their size as returned by
size() and labels as
returned by getLabel() . |
static Comparator<Cluster> |
BY_SCORE_COMPARATOR
Compares clusters by score as returned by
SCORE . |
static Comparator<Cluster> |
BY_SIZE_COMPARATOR
Compares clusters by size as returned by
size() . |
static String |
OTHER_TOPICS
Indicates that the cluster is an Other Topics cluster.
|
static Comparator<Cluster> |
OTHER_TOPICS_AT_THE_END
A comparator that puts
OTHER_TOPICS clusters at the end of the list. |
static String |
OTHER_TOPICS_LABEL
Default label for the Other Topics cluster.
|
static String |
SCORE
Score of this cluster that indicates the clustering algorithm's beliefs on the
quality of this cluster.
|
Constructor and Description |
---|
Cluster()
Creates a
Cluster with an empty label, no documents and no subclusters. |
Cluster(Integer id,
String phrase,
Document... documents)
Same as
Cluster(String,Document...) but allows specifying
cluster identifier. |
Cluster(String phrase,
Document... documents)
Creates a
Cluster with the provided phrase to be used as the
cluster's label and documents contained in the cluster. |
Modifier and Type | Method and Description |
---|---|
Cluster |
addDocument(Document document)
Method optimized for single document instead of a vararg.
|
Cluster |
addDocuments(Document... documents)
Adds document to this cluster.
|
Cluster |
addDocuments(Iterable<Document> documents)
Adds document to this cluster.
|
Cluster |
addPhrases(Iterable<String> phrases)
Adds phrases to the description of this cluster.
|
Cluster |
addPhrases(String... phrases)
Adds phrases to the description of this cluster.
|
Cluster |
addSubcluster(Cluster cluster)
Adds a subcluster to this cluster.
|
Cluster |
addSubclusters(Cluster... subclusters)
Adds subclusters to this cluster
|
Cluster |
addSubclusters(Iterable<Cluster> clusters)
Adds subclusters to this cluster
|
static Cluster |
appendOtherTopics(List<Document> allDocuments,
List<Cluster> clusters)
If there are unclustered documents, appends the "Other Topics" group to the
clusters . |
static Cluster |
appendOtherTopics(List<Document> allDocuments,
List<Cluster> clusters,
String label)
If there are unclustered documents, appends the "Other Topics" group to the
clusters . |
static void |
assignClusterIds(Collection<Cluster> clusters)
Assigns sequential identifiers to the provided
clusters (and their
sub-clusters). |
static Cluster |
buildOtherTopics(List<Document> allDocuments,
List<Cluster> clusters)
Builds an "Other Topics" cluster that groups those documents from
allDocument that were not referenced in any cluster in
clusters . |
static Cluster |
buildOtherTopics(List<Document> allDocuments,
List<Cluster> clusters,
String label)
Builds an "Other Topics" cluster that groups those documents from
allDocument that were not referenced in any cluster in
clusters . |
static Comparator<Cluster> |
byReversedWeightedScoreAndSizeComparator(double scoreWeight)
Returns a comparator that compares clusters based on the aggregation of their size
and score.
|
static Cluster |
find(int id,
Collection<Cluster> clusters)
Locate the first cluster that has id equal to
id . |
static List<Cluster> |
flatten(Collection<Cluster> hierarchical)
Flattens a hierarchy of clusters into a flat list.
|
List<Document> |
getAllDocuments()
Returns all documents contained in this cluster and (recursively) all documents
from this cluster's subclusters.
|
List<Document> |
getAllDocuments(Comparator<Document> comparator)
Returns all documents in this cluster ordered according to the provided comparator.
|
<T> T |
getAttribute(String key)
Returns the attribute associated with this cluster under the provided
key . |
Map<String,Object> |
getAttributes()
Returns all attributes of this cluster.
|
List<Document> |
getDocuments()
Returns all documents contained in this cluster.
|
Integer |
getId()
Internal identifier of this cluster within the
ProcessingResult . |
String |
getLabel()
Formats this cluster's label.
|
List<String> |
getPhrases()
Returns all phrases describing this cluster.
|
Double |
getScore()
Returns this cluster's "score" field.
|
List<Cluster> |
getSubclusters()
Returns all subclusters of this cluster.
|
boolean |
isOtherTopics()
Returns
true if this cluster is the OTHER_TOPICS cluster. |
void |
remapDocumentReferences(IdentityHashMap<Document,Document> docMapping)
An extremely dodgy method that remaps
Document references
inside this cluster. |
<T> Cluster |
removeAttribute(String key)
Unconditionally remove an attribute from this cluster, if it exists.
|
<T> Cluster |
setAttribute(String key,
T value)
Associates an attribute with this cluster.
|
Cluster |
setOtherTopics(boolean isOtherTopics)
Sets the
OTHER_TOPICS attribute of this cluster. |
Cluster |
setScore(Double score)
Sets this cluster's
SCORE field. |
int |
size()
Returns the size of the cluster calculated as the number of unique documents it
contains, including its subclusters.
|
String |
toString() |
public static final String OTHER_TOPICS
Type of this attribute is Boolean
.
public static final String OTHER_TOPICS_LABEL
public static final String SCORE
Type of this attribute is Double
.
public static final Comparator<Cluster> BY_SIZE_COMPARATOR
size()
. Clusters with more
documents are larger.public static final Comparator<Cluster> BY_SCORE_COMPARATOR
SCORE
. Clusters with larger
score are larger.public static final Comparator<Cluster> BY_LABEL_COMPARATOR
getLabel()
.public static final Comparator<Cluster> BY_REVERSED_SIZE_AND_LABEL_COMPARATOR
size()
and labels as
returned by getLabel()
. In case of equal sizes, natural order of the
labels decides.
Please note: this is a reversed comparator, so "larger" clusters end up nearer the beginning of the list being sorted (which is usually the order in which the applications want to display clusters).
public static final Comparator<Cluster> BY_REVERSED_SCORE_AND_LABEL_COMPARATOR
SCORE
and labels as
returned by getLabel()
. In case of equal scores, natural order of the
labels decides.
Please note: this is a reversed comparator, so "larger" clusters end up nearer the beginning of the list being sorted (which is usually the order in which the applications want to display clusters).
public static final Comparator<Cluster> OTHER_TOPICS_AT_THE_END
OTHER_TOPICS
clusters at the end of the list. In
other words, to this comparator an OTHER_TOPICS
topics cluster is "bigger"
than a non-{OTHER_TOPICS
cluster.
Note: This comparator is designed for use in combination with
other comparators, such as BY_REVERSED_SIZE_AND_LABEL_COMPARATOR
. If you
only need to partition a list of clusters into regular and other topic ones, this
is better done in linear time without resorting to Collections.sort(List)
.
public Cluster()
Cluster
with an empty label, no documents and no subclusters.public Cluster(String phrase, Document... documents)
Cluster
with the provided phrase
to be used as the
cluster's label and documents
contained in the cluster.phrase
- the phrase to form the cluster's labeldocuments
- documents contained in the clusterpublic Cluster(Integer id, String phrase, Document... documents)
Cluster(String,Document...)
but allows specifying
cluster identifier.public String getLabel()
getPhrases()
.public List<String> getPhrases()
public List<Cluster> getSubclusters()
public List<Document> getDocuments()
public List<Document> getAllDocuments()
getDocuments()
and then
documents from subclusters.public List<Document> getAllDocuments(Comparator<Document> comparator)
Document
for common comparators.public Cluster addPhrases(String... phrases)
phrases
- to be added to the description of this clusterpublic Cluster addPhrases(Iterable<String> phrases)
phrases
- to be added to the description of this clusterpublic Cluster addDocuments(Document... documents)
documents
- to be added to this clusterpublic Cluster addDocument(Document document)
addDocuments(Document...)
public Cluster addDocuments(Iterable<Document> documents)
documents
- to be added to this clusterpublic Cluster addSubclusters(Cluster... subclusters)
subclusters
- to be added to this clusterpublic Cluster addSubcluster(Cluster cluster)
addSubclusters(Cluster...)
public Cluster addSubclusters(Iterable<Cluster> clusters)
clusters
- to be added to this clusterpublic Cluster setScore(Double score)
SCORE
field.score
- score to setpublic <T> T getAttribute(String key)
key
. If there is no attribute under the provided key
,
null
will be returned.key
- of the attributenull
public <T> Cluster setAttribute(String key, T value)
key
- for the attributevalue
- for the attributepublic <T> Cluster removeAttribute(String key)
public Map<String,Object> getAttributes()
public int size()
public Integer getId()
ProcessingResult
. This
identifier is assigned dynamically after clusters are passed to
ProcessingResult
.ProcessingResult
public boolean isOtherTopics()
true
if this cluster is the OTHER_TOPICS
cluster.public Cluster setOtherTopics(boolean isOtherTopics)
OTHER_TOPICS
attribute of this cluster.isOtherTopics
- if true
, this cluster will be marked as an
Other Topics cluster.public static Comparator<Cluster> byReversedWeightedScoreAndSizeComparator(double scoreWeight)
scoreWeight
is 0.0, the order depends only on cluster
sizes. If scoreWeight
is 1.1, the order depends only on cluster
scores. For scoreWeight
values between 0.0 and 1.0, the higher the
scoreWeight
, the more contribution of cluster scores to the order. In
case of a tie on the aggregated cluster size and score, clusters are compared by
the natural order of their labels.
Please note: this is a reversed comparator, so "larger" clusters end up nearer the beginning of the list being sorted (which is usually the order in which the applications want to display clusters).
public static void assignClusterIds(Collection<Cluster> clusters)
clusters
(and their
sub-clusters). If any cluster already has an identifier, identifier will not be
changed but all clusters must have unique identifiers.clusters
- Clusters to assign identifiers to.IllegalArgumentException
- if the provided clusters contain non-unique
identifiers.public static List<Cluster> flatten(Collection<Cluster> hierarchical)
public static Cluster find(int id, Collection<Cluster> clusters)
id
. The search includes
all the clusters in the input and their sub-clusters. The first cluster with
matching identifier is returned or null
if no such cluster could be
found.public static Cluster buildOtherTopics(List<Document> allDocuments, List<Cluster> clusters)
allDocument
that were not referenced in any cluster in
clusters
.allDocuments
- all documents to check againstclusters
- list of clusters with assigned documentspublic static Cluster buildOtherTopics(List<Document> allDocuments, List<Cluster> clusters, String label)
allDocument
that were not referenced in any cluster in
clusters
.allDocuments
- all documents to check againstclusters
- list of clusters with assigned documentslabel
- label for the "Other Topics" grouppublic static Cluster appendOtherTopics(List<Document> allDocuments, List<Cluster> clusters)
clusters
.buildOtherTopics(List, List)
public static Cluster appendOtherTopics(List<Document> allDocuments, List<Cluster> clusters, String label)
clusters
.buildOtherTopics(List, List, String)
public void remapDocumentReferences(IdentityHashMap<Document,Document> docMapping)
Document
references
inside this cluster. This operation is allowed only when the cluster has not been
assigned an ID yet (so theoretically before the ProcessingResult
has been
published. While there are theoretically other ways to achieve the same result (copying
the entire set of clusters) this is the most memory and cpu efficient way.
Only documents from this cluster are remapped, subclusters need to be processed separately.