org.carrot2.text.vsm.TermDocumentMatrixBuilder

All Implemented Interfaces:: AcceptingVisitor

public class TermDocumentMatrixBuilder
extends AttrComposite

Builds a term document matrix based on the provided PreprocessingContext.

Field Summary

Fields
Modifier and Type	Field	Description
`AttrDouble`	`boostedFieldWeight`	Gives more weight to words that appeared in title fields.
`AttrStringArray`	`boostFields`	Specifies a list of field names that are boosted by `boostedFieldWeight` attribute.
`AttrInteger`	`maximumMatrixSize`	Maximum term-document matrix size.
`AttrDouble`	`maxWordDf`	Maximum word document frequency.
`TermWeighting`	`termWeighting`	Term weighting.

Fields inherited from class org.carrot2.attrs.AttrComposite

attributes

Constructor Summary

Constructors

Constructor Description

TermDocumentMatrixBuilder()

Method Summary

Modifier and Type	Method	Description
`void`	`buildTermDocumentMatrix(VectorSpaceModelContext vsmContext)`	Builds a term-document matrix from data provided in the `context`, stores the result in there.
`void`	`buildTermPhraseMatrix(VectorSpaceModelContext context)`	Builds a term-phrase matrix in the same space as the main term-document matrix.

Methods inherited from class org.carrot2.attrs.AttrComposite

accept

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- boostedFieldWeight
  
  public final AttrDouble boostedFieldWeight
  
  Gives more weight to words that appeared in title fields. The larger the value, the stronger boost the title words will receive.
- boostFields
  
  public AttrStringArray boostFields
  
  Specifies a list of field names that are boosted by boostedFieldWeight attribute. Content of fields provided in this attribute can be given more weight during clustering.
- maximumMatrixSize
  
  public final AttrInteger maximumMatrixSize
  
  Maximum term-document matrix size. Determines the maximum number of the term-document matrix elements. The larger the size, the more accurate, time- and memory-consuming clustering.
- maxWordDf
  
  public final AttrDouble maxWordDf
  
  Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.
  This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.
  Another useful application of this attribute is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values: 0.1 or 0.05.
- termWeighting
  
  public TermWeighting termWeighting
  
  Term weighting. The method for calculating weight of words in the term-document matrices.
Constructor Details
- TermDocumentMatrixBuilder
  
  public TermDocumentMatrixBuilder()
Method Details
- buildTermDocumentMatrix
  
  public void buildTermDocumentMatrix(VectorSpaceModelContext vsmContext)
  
  Builds a term-document matrix from data provided in the context, stores the result in there.
- buildTermPhraseMatrix
  
  public void buildTermPhraseMatrix(VectorSpaceModelContext context)
  
  Builds a term-phrase matrix in the same space as the main term-document matrix. If the processing context contains no phrases, VectorSpaceModelContext.termPhraseMatrix will remain null.

Class TermDocumentMatrixBuilder

Field Summary

Fields inherited from class org.carrot2.attrs.AttrComposite

Constructor Summary

Method Summary

Methods inherited from class org.carrot2.attrs.AttrComposite

Methods inherited from class java.lang.Object

Field Details

Constructor Details

Method Details