如何在大量的文本中找到常见的短语 [英] How to find common phrases in a large body of text

查看：143 发布时间：2017/4/3 11:52:09 data-structures graph data-mining text-analysis

本文介绍了如何在大量的文本中找到常见的短语的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从事一个项目，我需要在大量文本中挑选最常见的短语。例如说我们有三个句子，如下所示：

狗跳过了。 >
狗跳上了。

狗跳上了。

从上面的例子我想提取狗跳，因为它是最常见的短语文本。起初我以为，哦，我可以使用有重复节点的有向图：

有向图http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f.png

编辑：抱歉，我做错了，当这个图over，进入和向上应该都链接回。

我要去维持一个单词在每个节点对象中发生多少次的计数（该将为6;狗和跳跃，3等），但是尽管有许多其他问题，当我们添加几个更多的例子（请忽略不好的语法:-)）：

狗跳上来。

狗跳得很开心。

我们现在有一个问题，因为 dog 将启动一个新的根节点（与the相同的级别），我们不会识别 dog jumped 现在是最常见的短语。所以现在我想，也许我可以使用一个无向图来映射所有单词之间的关系，并最终挑出常见的短语，但我不知道这将如何工作，因为你失去了重要的秩序关系的话。

所有人都有任何关于如何识别大量文本中常见短语以及我将使用什么数据结构的一般想法。

谢谢，
本

解决方案

我以前发布过，但我使用 R 我所有的数据挖掘任务，非常适合这种分析。特别是看看 tm 包。以下是一些相关链接：

关于统计计算杂志中的包的论文： http：//www.jstatsoft .org / v25 / i05 / paper 。本文包括一个分析R-devel
邮件列表的好例子（ https ：//stat.ethz.ch/pipermail/r-devel/ ）2006年新闻组发布。

包主页： http://cran.r-project.org/web/packages/tm/index.html

查看介绍小插页： http ：//cran.r-project.org/web/packages/tm/vignettes/tm.pdf

更多一般来说，大量文本挖掘软件包在CRAN上的自然语言处理视图

I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following:

The dog jumped over the woman.
The dog jumped into the car.
The dog jumped up the stairs.

From the above example I would want to extract "the dog jumped" as it is the most common phrase in the text. At first I thought, "oh lets use a directed graph [with repeated nodes]":

directed graph http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f.png

EDIT: Apologies, I made a mistake while making this diagram "over", "into" and "up" should all link back to "the".

I was going to maintain a count of how many times a word occurred in each node object ("the" would be 6; "dog" and "jumped", 3; etc.) but despite many other problems the main one came up when we add a few more examples like (please ignore the bad grammar :-)):

Dog jumped up and down.
Dog jumped like no dog had ever jumped before.
Dog jumped happily.

We now have a problem since "dog" would start a new root node (at the same level as "the") and we would not identify "dog jumped" as now being the most common phrase. So now I am thinking maybe I could use an undirected graph to map the relationships between all the words and eventually pick out the common phrases but I'm not sure how this is going to work either, as you lose the important relationship of order between the words.

So does anyone have any general ideas on how to identify common phrases in a large body of text and what data structure I would use.

Thanks, Ben

解决方案

Check out this related question: What techniques/tools are there for discovering common phrases in chunks of text? Also related to the longest common substring problem.

I've posted this before, but I use R for all of my data-mining tasks and it's well suited to this kind of analysis. In particular, look at the tm package. Here are some relevant links:

Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

这篇关于如何在大量的文本中找到常见的短语的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在大量的文本中找到常见的短语 [英] How to find common phrases in a large body of text

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何在大量的文本中找到常见的短语 [英] How to find common phrases in a large body of text

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭