在文本中查找相关单词的算法 [英] Algorithm to find related words in a text

查看:95
本文介绍了在文本中查找相关单词的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想输入一个词(例如"Apple")并处理一个文本(或更多).我想提出相关的术语,例如:处理Apple的文档并找到iPod,iPhone ,Mac是与"Apple"相关的术语.

I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".

关于如何解决此问题的任何想法?

Any idea on how to solve this?

推荐答案

作为起点:您的问题与文本挖掘.

As a starting point: your question relates to text mining.

有两种方法:一种统计方法,一种是自然语言处理(nlp).

There are two ways: a statistical approach, and one form natural language processing (nlp).

我对nlp不太了解,但是可以谈谈统计方法:

I do not know much about nlp, but can say something about the statistical approach:

  1. 您需要一些文档的矢量空间表示形式,请参见 http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf

  1. You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf

要学习语义,即:不同的单词表示相同的含义,或者一个单词可以具有不同的含义,则需要一个较大的文本语料库进行学习.正如我所说的,这是一种统计方法,因此您需要大量样本. http://www.daviddlewis.com/resources/testcollections/

In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/

也许您要使用的上下文中有很多文档.那是最好的情况.

Maybe you have lots of documents from the context you are going to use. That is the best situation.

您必须从该语料库中检索潜在因素.最常见的是:

You have to retrieve latent factors from this corpus. Most common are:

  • LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
  • PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
  • nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
  • latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

这些方法涉及大量数学.您要么挖掘它,要么就必须找到好的库.

These methods involve lots of math. Either you dig it, or you have to find good libraries.

我可以推荐以下书籍:

  • http://www.oreilly.de/catalog/9780596529321/toc.html
  • http://www.oreilly.de/catalog/9780596516499/index.html

这篇关于在文本中查找相关单词的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆