@Bindable(prefix="TermDocumentMatrixBuilder") public class TermDocumentMatrixBuilder extends Object
PreprocessingContext
.Modifier and Type | Field and Description |
---|---|
static String |
MATRIX_MODEL
Group name. |
int |
maximumMatrixSize
Maximum matrix size.
|
double |
maxWordDf
Maximum word document frequency.
|
ITermWeighting |
termWeighting
Term weighting.
|
double |
titleWordsBoost
Title word boost.
|
Constructor and Description |
---|
TermDocumentMatrixBuilder() |
Modifier and Type | Method and Description |
---|---|
void |
buildTermDocumentMatrix(VectorSpaceModelContext vsmContext)
Builds a term document matrix from data provided in the
context ,
stores the result in there. |
void |
buildTermPhraseMatrix(VectorSpaceModelContext context)
Builds a term-phrase matrix in the same space as the main term-document matrix.
|
public static final String MATRIX_MODEL
Group
name.@Input @Processing @Attribute @Level(value=MEDIUM) @Group(value="Labels") public double titleWordsBoost
Document.TITLE
fields.@Input @Processing @Attribute @IntRange(min=5000) @Internal(configuration=true) @Level(value=ADVANCED) @Group(value="Matrix model") public int maximumMatrixSize
@Input @Processing @Attribute @Level(value=ADVANCED) @Group(value="Matrix model") public double maxWordDf
maxWordDf
will be ignored. For example, when maxWordDf
is
0.4
, words appearing in more than 40% of documents will be be ignored.
A value of 1.0
means that all words will be taken into
account, no matter in how many documents they appear.
This attribute may be useful when certain words appear in most of the input
documents (e.g. company name from header or footer) and such words dominate the
cluster labels. In such case, setting maxWordDf
to a value lower than
1.0
, e.g. 0.9
may improve the clusters.
Another useful application of this attribute is when there is a need to generate
only very specific clusters, i.e. clusters containing small numbers of documents.
This can be achieved by setting maxWordDf
to extremely low values,
e.g. 0.1
or 0.05
.
@Input @Processing @Attribute @Required @Level(value=ADVANCED) @Group(value="Matrix model") public ITermWeighting termWeighting
public void buildTermDocumentMatrix(VectorSpaceModelContext vsmContext)
context
,
stores the result in there.public void buildTermPhraseMatrix(VectorSpaceModelContext context)
VectorSpaceModelContext.termPhraseMatrix
will remain null
.