Package org.carrot2.text.vsm
Class TermDocumentMatrixBuilder
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.vsm.TermDocumentMatrixBuilder
- All Implemented Interfaces:
AcceptingVisitor
public class TermDocumentMatrixBuilder extends AttrComposite
Builds a term document matrix based on the provided
PreprocessingContext.-
Field Summary
Fields Modifier and Type Field Description AttrDoubleboostedFieldWeightGives more weight to words that appeared in title fields.AttrStringArrayboostFieldsSpecifies a list of field names that are boosted byboostedFieldWeightattribute.AttrIntegermaximumMatrixSizeMaximum term-document matrix size.AttrDoublemaxWordDfMaximum word document frequency.TermWeightingtermWeightingTerm weighting. -
Constructor Summary
Constructors Constructor Description TermDocumentMatrixBuilder() -
Method Summary
Modifier and Type Method Description voidbuildTermDocumentMatrix(VectorSpaceModelContext vsmContext)Builds a term-document matrix from data provided in thecontext, stores the result in there.voidbuildTermPhraseMatrix(VectorSpaceModelContext context)Builds a term-phrase matrix in the same space as the main term-document matrix.
-
Field Details
-
boostedFieldWeight
Gives more weight to words that appeared in title fields. The larger the value, the stronger boost the title words will receive. -
boostFields
Specifies a list of field names that are boosted byboostedFieldWeightattribute. Content of fields provided in this attribute can be given more weight during clustering. -
maximumMatrixSize
Maximum term-document matrix size. Determines the maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering. -
maxWordDf
Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger thanmaxWordDfwill be ignored. For example, whenmaxWordDfis 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.
Another useful application of this attribute is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting
maxWordDfto extremely low values: 0.1 or 0.05. -
termWeighting
Term weighting. The method for calculating weight of words in the term-document matrices.
-
-
Constructor Details
-
TermDocumentMatrixBuilder
public TermDocumentMatrixBuilder()
-
-
Method Details
-
buildTermDocumentMatrix
Builds a term-document matrix from data provided in thecontext, stores the result in there. -
buildTermPhraseMatrix
Builds a term-phrase matrix in the same space as the main term-document matrix. If the processing context contains no phrases,VectorSpaceModelContext.termPhraseMatrixwill remainnull.
-