从小文本内容(如推文)生成标签 [英] tag generation from a small text content (such as tweets)

查看:38
本文介绍了从小文本内容(如推文)生成标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前已经问过一个类似的问题,但我没有注意到我有很大的限制:我正在处理小文本集,例如用户推文以生成标签(关键字).

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).

而且似乎已接受的建议(逐点互信息算法)旨在处理更大的文档.

And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.

有了这个限制(处理一小组文本),我如何生成标签?

With this constrain(working on small set of texts), how can I generate tags ?

问候

推荐答案

多词标签的两阶段方法

您可以将所有推文合并到一个更大的文档中,然后从整个推文集合中提取n 个最有趣的搭配.然后,您可以返回并使用其中出现的搭配标记每条推文.使用这种方法,n 将是为整个数据集生成的多词标签总数.

You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.

对于第一阶段,您可以使用 此处发布的 NLTK 代码.第二阶段可以通过对所有推文进行简单的 for 循环来完成.但是,如果速度是一个问题,您可以使用 pylucene 快速找到包含每个搭配的推文.

For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.

单个词标签的推文级别 PMI

同样建议 此处,对于单字标签,您可以计算每个单词和推文本身的逐点互信息,即

As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.

PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet)) 

同样,这将粗略地告诉您,与在更大的集合中遇到它相比,在特定文档中遇到该术语的惊讶程度有多少(或更多).然后,您可以使用一些具有最高 PMI 与推文的术语来标记推文.

Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.

推文的一般变化

在使用推文标记时您可能想要进行的一些更改包括:

Some changes you might want to make when tagging with tweets include:

  • 仅使用单词或搭配作为推文的标签,前提是它出现在其他推文的特定数量或百分比.否则,PMI 将倾向于使用仅出现在一条推文中但在其他任何地方都看不到的奇怪术语标记推文,例如拼写错误和键盘噪音,如 #@$#@$%!.

  • Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like #@$#@$%!.

根据每条推文的长度调整使用的标签数量.您可能能够为较长的推文提取 2 或 3 个有趣的标签.但是,对于较短的 2 字推文,您可能不想使用每个单词和搭配来标记它.在给定推文长度的情况下,对于您想要提取的标签数量,可能值得尝试不同的截断值.

Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.

这篇关于从小文本内容(如推文)生成标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆