如何从一系列文本输入中提取常用/重要短语 [英] How to extract common / significant phrases from a series of text entries

查看:80
本文介绍了如何从一系列文本输入中提取常用/重要短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列文本项-来自MySQL数据库的原始HTML.我想在这些条目中找到最常用的短语(而不是单个最常用的短语,理想情况下,不强制单词对单词的匹配).

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).

我的示例是Yelp.com上的任何评论,它以给定餐厅的数百条评论显示了3个摘要,格式为:

My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:

尝试汉堡包"(共44条评论)

"Try the hamburger" (in 44 reviews)

例如,此页面的评论重点"部分:

e.g., the "Review Highlights" section of this page:

我已经安装了NLTK,并且已经对其进行了一些尝试,但是老实说,这些选择让他们不知所措.这似乎是一个相当普遍的问题,我无法通过在此处搜索找到简单的解决方案.

I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here.

推荐答案

我怀疑您不只是想要最常用的短语,而是想要最有趣的搭配.否则,您可能最终会过度表达由普通单词组成的短语,而减少有趣和翔实的短语.

I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.

为此,您基本上需要从数据中提取n-gram,然后找到具有最高指向明智的共同信息(PMI).就是说,您希望找到共同出现的单词,而不是希望它们偶然出现的单词.

To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.

NLTK搭配方法涵盖了大约7行内容代码,例如:

The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

这篇关于如何从一系列文本输入中提取常用/重要短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆