在文本中查找相关单词的算法 [英] Algorithm to find related words in a text
问题描述
我想输入一个词(例如"Apple")并处理一个文本(或更多).我想提出相关的术语,例如:处理Apple的文档并找到iPod,iPhone ,Mac是与"Apple"相关的术语.
I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".
关于如何解决此问题的任何想法?
Any idea on how to solve this?
推荐答案
作为起点:您的问题与文本挖掘.
As a starting point: your question relates to text mining.
有两种方法:一种统计方法,一种是自然语言处理(nlp).
There are two ways: a statistical approach, and one form natural language processing (nlp).
我对nlp不太了解,但是可以谈谈统计方法:
I do not know much about nlp, but can say something about the statistical approach:
-
您需要一些文档的矢量空间表示形式,请参见 http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf
You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf
要学习语义,即:不同的单词表示相同的含义,或者一个单词可以具有不同的含义,则需要一个较大的文本语料库进行学习.正如我所说的,这是一种统计方法,因此您需要大量样本. http://www.daviddlewis.com/resources/testcollections/
In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/
也许您要使用的上下文中有很多文档.那是最好的情况.
Maybe you have lots of documents from the context you are going to use. That is the best situation.
您必须从该语料库中检索潜在因素.最常见的是:
You have to retrieve latent factors from this corpus. Most common are:
- LSA( http://en.wikipedia.org/wiki/Latent_semantic_analysis )
- PLSA( http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis )
- 非负矩阵分解( http://en.wikipedia.org/wiki/Non-negative_matrix_factorization )
- 潜在的狄利克雷分配( http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation )
- LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
- PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
- nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
- latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
这些方法涉及大量数学.您要么挖掘它,要么就必须找到好的库.
These methods involve lots of math. Either you dig it, or you have to find good libraries.
我可以推荐以下书籍:
- http://www.oreilly.de/catalog/9780596529321/toc.html
- http://www.oreilly.de/catalog/9780596516499/index.html
- http://www.oreilly.de/catalog/9780596529321/toc.html
- http://www.oreilly.de/catalog/9780596516499/index.html
这篇关于在文本中查找相关单词的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!