从小文本内容(例如推文)生成标签 [英] tag generation from a small text content (such as tweets)

查看:127
本文介绍了从小文本内容(例如推文)生成标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我早些时候已经问过类似的问题,但我没有注意到我有一个很大的约束:我正在处理小型文本集(例如用户Tweets)以生成标签(关键字).

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).

似乎接受的建议(逐点互信息算法)旨在处理更大的文档.

And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.

在这种限制下(只能处理少量文本),如何生成标签?

With this constrain(working on small set of texts), how can I generate tags ?

致谢

推荐答案

多词标签的两阶段方法

您可以将所有tweets 合并到一个较大的文档中,然后从整个tweet集合中提取 n 最有趣的搭配.然后,您可以返回并用其中出现的搭配标记每个推文.使用这种方法, n 将是为整个数据集生成的多字标签的总数.

You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.

对于第一阶段,您可以使用此处发布的NLTK代码.第二步可以通过所有推文上的简单for循环来完成.但是,如果您担心速度,可以使用 pylucene 快速找到包含每个搭配的推文

For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.

单个单词标签的推文级别PMI

还建议此处,用于单个单词标签,您可以计算每个单词和tweet本身的逐点相互信息

As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.

PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet)) 

同样,这将粗略地告诉您,在大型文档中使用该术语时,您对特定文档中的术语感到惊讶的程度是多少(或更多).然后,您可以用推文中带有最高PMI的几个术语来标记该推文.

Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.

推文的常规更改

使用tweet标记时,您可能需要进行以下更改:

Some changes you might want to make when tagging with tweets include:

  • 如果在其他推文的一定数量或百分比之内出现,则仅将单词或搭配词用作推文的标签.否则,PMI倾向于使用仅在一条推文中出现但在其他任何地方都看不到的奇数术语标记推文.拼写错误和键盘杂音,例如#@ $#@ $%!.

  • Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like #@$#@$%!.

根据每条推文的长度缩放使用的标签数.您可能能够提取2或3个有趣的标签,以获得更长的鸣叫.但是,对于较短的2个单词的推文,您可能不想使用每个单词和搭配词来对其进行标记.在给定的推文长度下,您有必要尝试使用不同的截止值来尝试提取多少个标签.

Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.

这篇关于从小文本内容(例如推文)生成标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆