如何从一系列文本条目中提取常见/重要的短语 [英] How to extract common / significant phrases from a series of text entries

查看:27
本文介绍了如何从一系列文本条目中提取常见/重要的短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列文本项目——来自 MySQL 数据库的原始 HTML.我想在这些条目中找到最常见的短语(不是最常见的短语,理想情况下,不强制逐字匹配).

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).

我的例子是 Yelp.com 上的任何评论,它显示了来自给定餐厅数百条评论的 3 个片段,格式为:

My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:

尝尝汉堡"(在 44 条评论中)

"Try the hamburger" (in 44 reviews)

例如,此页面的评论要点"部分:

e.g., the "Review Highlights" section of this page:

http://www.yelp.com/biz/sushi-gen-los-angeles/

我已经安装了 NLTK 并且已经尝试了一些,但老实说我被这些选项淹没了.这似乎是一个相当普遍的问题,我无法通过在此处搜索找到直接的解决方案.

I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here.

推荐答案

我怀疑您不仅想要最常见的短语,还想要最有趣的搭配.否则,您最终可能会出现过多的由常用词组成的短语,而有趣且信息量较少的短语.

I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.

为此,您本质上需要从数据中提取 n-gram,然后找到具有最高 点智能互信息(PMI).也就是说,您希望找到比您预期的要多得多的同时出现的词.

To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.

NLTK collocations how-to 用大约 7 行代码介绍了如何做到这一点代码,例如:

The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

这篇关于如何从一系列文本条目中提取常见/重要的短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆