public static class TermDocumentMatrixBuilderDescriptor.AttributeBuilder extends Object
TermDocumentMatrixBuilder
component. You can use this
builder as a type-safe alternative to populating the attribute map using attribute keys.Modifier and Type | Field and Description |
---|---|
Map<String,Object> |
map
The attribute map populated by this builder.
|
Modifier | Constructor and Description |
---|---|
protected |
AttributeBuilder(Map<String,Object> map)
Creates a builder backed by the provided map.
|
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder titleWordsBoost(double value)
Document.TITLE
fields.public TermDocumentMatrixBuilderDescriptor.AttributeBuilder titleWordsBoost(org.carrot2.util.attribute.IObjectFactory<? extends Double> value)
Document.TITLE
fields.public TermDocumentMatrixBuilderDescriptor.AttributeBuilder maximumMatrixSize(int value)
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder maximumMatrixSize(org.carrot2.util.attribute.IObjectFactory<? extends Integer> value)
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder maxWordDf(double value)
maxWordDf
will be ignored. For example, when maxWordDf
is
0.4
, words appearing in more than 40% of documents will be be ignored.
A value of 1.0
means that all words will be taken into
account, no matter in how many documents they appear.
This attribute may be useful when certain words appear in most of the input
documents (e.g. company name from header or footer) and such words dominate the
cluster labels. In such case, setting maxWordDf
to a value lower than
1.0
, e.g. 0.9
may improve the clusters.
Another useful application of this attribute is when there is a need to generate
only very specific clusters, i.e. clusters containing small numbers of documents.
This can be achieved by setting maxWordDf
to extremely low values,
e.g. 0.1
or 0.05
.
TermDocumentMatrixBuilder.maxWordDf
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder maxWordDf(org.carrot2.util.attribute.IObjectFactory<? extends Double> value)
maxWordDf
will be ignored. For example, when maxWordDf
is
0.4
, words appearing in more than 40% of documents will be be ignored.
A value of 1.0
means that all words will be taken into
account, no matter in how many documents they appear.
This attribute may be useful when certain words appear in most of the input
documents (e.g. company name from header or footer) and such words dominate the
cluster labels. In such case, setting maxWordDf
to a value lower than
1.0
, e.g. 0.9
may improve the clusters.
Another useful application of this attribute is when there is a need to generate
only very specific clusters, i.e. clusters containing small numbers of documents.
This can be achieved by setting maxWordDf
to extremely low values,
e.g. 0.1
or 0.05
.
TermDocumentMatrixBuilder.maxWordDf
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(ITermWeighting value)
TermDocumentMatrixBuilder.termWeighting
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(Class<?> clazz)
TermDocumentMatrixBuilder.termWeighting
public TermDocumentMatrixBuilderDescriptor.AttributeBuilder termWeighting(org.carrot2.util.attribute.IObjectFactory<? extends ITermWeighting> value)
TermDocumentMatrixBuilder.termWeighting