如何在大量文本中查找常用短语 [英] How to find common phrases in a large body of text

查看:24
本文介绍了如何在大量文本中查找常用短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个项目,我需要在大量文本中挑选出最常见的短语.例如,假设我们有如下三个句子:

I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following:

  • 狗跳过了女人.
  • 狗跳上了车.
  • 狗跳了上了楼梯.
  • The dog jumped over the woman.
  • The dog jumped into the car.
  • The dog jumped up the stairs.

从上面的例子中,我想提取the dog jumped",因为它是文本中最常见的短语.起初我想,哦,让我们使用有向图 [带有重复节点]":

From the above example I would want to extract "the dog jumped" as it is the most common phrase in the text. At first I thought, "oh lets use a directed graph [with repeated nodes]":

有向图 http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f>

directed graph http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f.png

编辑:抱歉,我在制作这张图over"、into"和up"时犯了一个错误,应该都链接回the".

EDIT: Apologies, I made a mistake while making this diagram "over", "into" and "up" should all link back to "the".

我打算保持一个单词在每个节点对象中出现的次数(the"为 6;dog"和jumped"为 3;等等),但尽管存在许多其他问题,但主要问题当我们添加更多示例时出现了(请忽略错误的语法:-)):

I was going to maintain a count of how many times a word occurred in each node object ("the" would be 6; "dog" and "jumped", 3; etc.) but despite many other problems the main one came up when we add a few more examples like (please ignore the bad grammar :-)):

  • 狗跳上跳下.
  • 狗跳得像以前从来没有过的狗一样.
  • 狗高兴地跳了起来.

我们现在遇到了一个问题,因为dog"会启动一个新的根节点(与the"处于同一级别)并且我们无法识别dog jumped"作为现在最常用的短语.所以现在我想也许我可以使用无向图来映射所有单词之间的关系并最终挑选出常用短语,但我也不确定这将如何工作,因为您失去了重要的顺序关系的话.

We now have a problem since "dog" would start a new root node (at the same level as "the") and we would not identify "dog jumped" as now being the most common phrase. So now I am thinking maybe I could use an undirected graph to map the relationships between all the words and eventually pick out the common phrases but I'm not sure how this is going to work either, as you lose the important relationship of order between the words.

对于如何识别大量文本中的常用短语以及我将使用什么数据结构,是否有人有任何大致的想法.

So does anyone have any general ideas on how to identify common phrases in a large body of text and what data structure I would use.

谢谢,本

推荐答案

查看这个相关问题:有哪些技术/工具可用于发现文本块中的常用短语? 也与 最长公共子串问题.

Check out this related question: What techniques/tools are there for discovering common phrases in chunks of text? Also related to the longest common substring problem.

我之前发布过这个,但我使用 R 来处理我的所有数据挖掘任务,并且它非常适合这种分析.特别是,查看 tm 包.以下是一些相关链接:

I've posted this before, but I use R for all of my data-mining tasks and it's well suited to this kind of analysis. In particular, look at the tm package. Here are some relevant links:

  • Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
  • Package homepage: http://cran.r-project.org/web/packages/tm/index.html
  • Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

更一般的,在自然语言上有大量的文本挖掘包CRAN 上的处理视图.

More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

这篇关于如何在大量文本中查找常用短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆