我们如何确定Latent语义分析的维数? [英] How do we decide the number of dimensions for Latent semantic analysis ?

查看:202
本文介绍了我们如何确定Latent语义分析的维数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近一直致力于潜在的语义分析。我已经使用Jama包在java中实现了它。

I have been working on latent semantic analysis lately. I have implemented it in java by making use of the Jama package.

以下是代码:

    Matrix vtranspose ; 
    a = new Matrix(termdoc);  
    termdoc = a.getArray(); 
    a = a.transpose() ; 
    SingularValueDecomposition sv =new SingularValueDecomposition(a) ; 
    u = sv.getU();
    v = sv.getV(); 
    s = sv.getS();
    vtranspose = v.transpose() ; // we obtain this as a result of svd 

    uarray = u.getArray();
    sarray = s.getArray(); 
    varray = vtranspose.getArray(); 
    if(semantics.maketerms.nodoc>50)
    {

        sarray_mod = new double[50][50]; 
        uarray_mod = new double[uarray.length][50];
        varray_mod = new double[50][varray.length]; 
        move(sarray,50,50,sarray_mod); 
        move(uarray,uarray.length,50,uarray_mod); 
        move(varray,50,varray.length,varray_mod); 
        e = new Matrix(uarray_mod); 
        f = new Matrix(sarray_mod);
        g = new Matrix(varray_mod);
        Matrix temp  =e.times(f); 
        result = temp.times(g);  

    }
    else 
    {
        Matrix temp = u.times(s); 
        result = temp.times(vtranspose); 
    }
    result = result.transpose(); 
    results = result.getArray() ; 

    return results ; 

但我们如何确定维数?有没有一种方法可以确定系统应该减少的维数以获得最佳结果?对于LSA的有效性能,我们还考虑了哪些其他参数?

But how do we determine the number of dimensions? Is there a method to determine the number of dimensions to which the system should be reduced to obtain best results? What other parameters do we consider for effective performance of LSA?

推荐答案

关于维数的选择:

1 ) http://en.wikipedia.org/wiki/Latent_semantic_indexing


LSI的另一个挑战是所谓的
难以确定用于执行
SVD的最佳维数。作为一般规则,较少的维度允许对文本集合中包含的概念进行更广泛的比较
,而更高的
维度允许更具体(或更相关)的
概念比较。可以使用
的实际维数受到集合中文档数量的限制。研究
已经证明,大约300个维度通常会提供
最佳结果与中等大小的文档集合(数百美元b $ b数千个文档)和大型文档
集合可能有400个维度(数百万份文件)。但是,最近的研究表明,根据文件集的大小和性质
,50-1000维度适合

Another challenge to LSI has been the alleged difficulty in determining the optimal number of dimensions to use for performing the SVD. As a general rule, fewer dimensions allow for broader comparisons of the concepts contained in a collection of text, while a higher number of dimensions enable more specific (or more relevant) comparisons of concepts. The actual number of dimensions that can be used is limited by the number of documents in the collection. Research has demonstrated that around 300 dimensions will usually provide the best results with moderate-sized document collections (hundreds of thousands of documents) and perhaps 400 dimensions for larger document collections (millions of documents). However, recent studies indicate that 50-1000 dimensions are suitable depending on the size and nature of the document collection.

检查金额计算SVD
后的数据差异可用于确定要保留的最佳维数。
可以通过在scree图中绘制
奇异值(S)来查看数据中包含的方差。一些LSI从业者选择与曲线拐点相关联的
维度作为要保留的维数的截止
点。其他人则认为必须保留一些
的方差数量,数据中的方差
应该决定保留的适当维数。
70%通常被称为
数据中的差异量,应该用于选择
重新计算SVD的最佳维数。

Checking the amount of variance in the data after computing the SVD can be used to determine the optimal number of dimensions to retain. The variance contained in the data can be viewed by plotting the singular values (S) in a scree plot. Some LSI practitioners select the dimensionality associated with the knee of the curve as the cut-off point for the number of dimensions to retain. Others argue that some quantity of the variance must be retained, and the amount of variance in the data should dictate the proper dimensionality to retain. Seventy percent is often mentioned as the amount of variance in the data that should be used to select the optimal dimensionality for recomputing the SVD.




2) http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic -analysis-tutorial.html?showall = 1


使用SVD的诀窍在于计算出多少维度或
近似矩阵时使用的概念。太少的维度
和重要的模式被遗漏,太多和由
随机单词选择引起的噪音将会回升。
SVD算法有点参与,但幸运的是Python有一个
库函数,使其易于使用。通过将下面的一行
方法添加到我们的LSA类中,我们可以将矩阵分解为3个其他
矩阵。 U矩阵给出了
概念空间中每个单词的坐标,Vt矩阵给出了我们概念空间中每个
文档的坐标,以及奇异值的S矩阵
为我们提供了一个线索,即我们需要多少维度或概念
包括。

The trick in using SVD is in figuring out how many dimensions or "concepts" to use when approximating the matrix. Too few dimensions and important patterns are left out, too many and noise caused by random word choices will creep back in. The SVD algorithm is a little involved, but fortunately Python has a library function that makes it simple to use. By adding the one line method below to our LSA class, we can factor our matrix into 3 other matrices. The U matrix gives us the coordinates of each word on our "concept" space, the Vt matrix gives us the coordinates of each document in our "concept" space, and the S matrix of singular values gives us a clue as to how many dimensions or "concepts" we need to include.

def calc(self ):self.U,self.S,self.Vt = svd(self.A)

为了
选择要使用正确的维数,我们可以得到奇异值平方的直方图
。这表示每个
奇异值对近似我们的矩阵的重要性。以下是我们示例中的
直方图。

In order to choose the right number of dimensions to use, we can make a histogram of the square of the singular values. This graphs the importance each singular value contributes to approximating our matrix. Here is the histogram in our example.


对于大型文档集合,使用的维度数为
在100到500范围内。在我们的小例子中,由于我们想要绘制
图,我们将使用3个维度,抛出第一个维度,并将
图表为第二维和第三维。

For large collections of documents, the number of dimensions used is in the 100 to 500 range. In our little example, since we want to graph it, we’ll use 3 dimensions, throw out the first dimension, and graph the second and third dimensions.

我们抛弃第一个维度的原因很有趣。对于
文档,第一个维度与
文档的长度相关。对于单词,它与所有文档中使用单词
的次数相关。如果我们将矩阵居中,通过
减去每列的平均列值,那么我们将
使用第一维。作为类比,考虑高尔夫比分。我们不想
想要知道实际分数,我们想知道
后的分数从标准杆中扣除。这告诉我们玩家是否制作了
小鸟,转向架等等。

The reason we throw out the first dimension is interesting. For documents, the first dimension correlates with the length of the document. For words, it correlates with the number of times that word has been used in all documents. If we had centered our matrix, by subtracting the average column value from each column, then we would use the first dimension. As an analogy, consider golf scores. We don’t want to know the actual score, we want to know the score after subtracting it from par. That tells us whether the player made a birdie, bogie, etc.




3)Landauer,TK,Foltz,PW,Laham,D。,(1998),潜在语义介绍
分析,话语过程,25,259-284:



3) Landauer, T.K., Foltz, P.W., Laham, D., (1998), 'Introduction to Latent Semantic Analysis', Discourse Processes, 25, 259-284:

这篇关于我们如何确定Latent语义分析的维数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆