LSA-潜在语义分析-如何在PHP中编码? [英] LSA - Latent Semantic Analysis - How to code it in PHP?

查看:157
本文介绍了LSA-潜在语义分析-如何在PHP中编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在PHP中实现潜在语义分析(LSA),以便找出文本的主题/标签.

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.

这是我认为我必须要做的. 这是对的吗?如何用PHP编写代码?如何确定选择哪些单词?

Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?

我不想使用任何外部库. 我已经实现了奇异值分解(SVD).

I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).

  1. 提取给定文本中的所有单词.
  2. 对单词/短语进行加权,例如与 tf–idf .如果加权太复杂,只需考虑出现的次数即可.
  3. 建立矩阵:列是数据库中的一些文档(越多越好?),行都是唯一的单词,值是出现的次数或权重.
  4. 执行奇异值分解(SVD).
  5. 使用矩阵S(SVD)中的值进行降维(如何做).
  1. Extract all words from the given text.
  2. Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
  3. Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
  4. Do the Singular Value Decomposition (SVD).
  5. Use the values in the matrix S (SVD) to do the dimension reduction (how?).

希望您能帮助我.提前非常感谢您!

I hope you can help me. Thank you very much in advance!

推荐答案

LSA链接:

  • Landauer (co-creator) article on LSA
  • the R-project lsa user guide

这是完整的算法.如果您拥有SVD,那么您将一路顺风顺水.上面的论文比我解释得更好.

Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.

假设:

  • 您的SVD函数将按降序给出奇异值和奇异矢量.如果不是,则您必须进行更多的杂技.
  • your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.

M :语料矩阵,w(单词)乘d(文档)(w行,d列).这些可以是原始计数,也可以是tfidf或其他任何值.停用词可能会消除也可能不会消除,并且词干可能会发生(Landauer表示保留停用词并且不会阻止,但对tfidf是肯定的.)

M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).

U,Sigma,V = singular_value_decomposition(M)

U:  w x w
Sigma:  min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V:  d x d matrix

Thus U * Sigma * V = M  
#  you might have to do some transposes depending on how your SVD code 
#  returns U and V.  verify this so that you don't go crazy :)

然后是归约性....实际的LSA论文提出了一个很好的近似值,即保持足够的向量,使它们的奇异值大于奇异值总数的50%.

Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.

更简洁...(伪代码)

More succintly... (pseudocode)

Let s1 = sum(Sigma).  
total = 0
for ii in range(len(Sigma)):
    val = Sigma[ii]
    total += val
    if total > .5 * s1:
        return ii

这将返回新基准的等级,该等级以前是min(d,w),现在我们用{ii}进行近似.

This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.

(此处为'->质数,而不是转置)

(here, ' -> prime, not transpose)

我们创建新的矩阵:U',Sigma',V',其大小分别为w x ii,ii x ii和ii x d.

We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.

这是LSA算法的本质.

That's the essence of the LSA algorithm.

此结果矩阵U'* Sigma'* V'可用于'改进的'余弦相似度搜索,或者,例如,您可以为其中的每个文档选择前3个字.这是否比简单的tf-idf还要大,还需要一些辩论.

This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.

对我来说,由于多义性,并且数据集包含太多主题,LSA在现实世界的数据集中表现不佳.它的数学/概率基础不充分(它假设呈正态分布(高斯分布),这对单词计数没有意义).

To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).

您的里程肯定会有所不同.

Your mileage will definitely vary.

使用LSA标记(一种方法!)

  1. 使用SVD和约简启发法构造U'Sigma'V'维约简矩阵

  1. Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic

用手查看U'矩阵,并提出描述每个主题"的术语.例如,如果该向量的最大部分是布朗克斯,洋基,曼哈顿",则纽约市"可能是一个很好的称呼.将它们保留在关联数组或列表中.由于向量的数量将是有限的,因此此步骤应该是合理的.

By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.

假设您有一个文档的单词向量(v1),则v1 * t(U')将为该文档提供最强的主题".选择最高的3个,然后给出上一步计算的主题".

Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.

这篇关于LSA-潜在语义分析-如何在PHP中编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆