用于线程推文的一组好的启发式方法是什么? [英] What's a good set of heuristics for threading tweets?

查看:35
本文介绍了用于线程推文的一组好的启发式方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个人都知道,如果您想发送电子邮件,请使用 JamieZawinski 的算法.但这是一个新世纪,有一个新的消息服务.

Everyone knows, if you want to thread emails you use Jamie Zawinski's algorithm. But it's a new century, and there's a new messaging service.

线程状态更新的最佳算法是什么?推特?

What's the best algorithm for threading status updates posted on Twitter?

我绝对喜欢它处理的事情:

Things I'd definitely like it to cope with:

  • 简单的部分:使用in_reply_to_status_idin_reply_to_user_idin_reply_to_screen_name.(顺便说一句,找到这些值的正确文档本身就很有用!这样的文件不是显然与来自这里,例如.)

  • The easy part: using in_reply_to_status_id, in_reply_to_user_id and in_reply_to_screen_name. (Incidentally, finding proper documentation of these values would be useful in itself! Such documentation isn't obviously linked to from here, for example.)

推断回复"关系的好方法使用 @ 约定提及用户但不是明确回复特定消息.这些提及"在实体"元素中提供现在的状态如果你要求.这些启发式方法可能会考虑帐户 (a) 两次状态更新之间的时间,(b) 是否两个用户之间有后续回复等.(回复由旧式转推和附加评论,如 user85509 提到的下面只是这种回复风格的一个例子.)

Good heuristics for inferring a "reply" relationship from messages that mention a user with the @ convention but aren't explicitly in reply to a particular message. These "mentions" are provided in the "entities" element of statuses now if you request that. These heuristics might take into account (a) the time between two status updates, (b) whether there are subsquent replies between the two users, etc. (Replies that consist of an old-style retweet with an additional comment, as mentioned by user85509 below are just an instance of this style of reply.)

发生在两个以上用户之间的对话.

Conversations that take place between more than two users.

使用一组提供给算法的推文,或所有推文在 Twitter 上发推文.

Working with a set of tweets given to the algorithm, or all tweets on Twitter.

...但也许你能想到更多.

... but perhaps you can think of more.

推荐答案

由于只有一个答案,而且赏金截止日期即将到来,我想我应该添加一个基准答案,这样赏金就不会自动授予答案不会超出问题的内容.

Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question.

显而易见的第一步是获取您的原始推文集并遵循所有in_reply_to_status_id 链接来构建许多有向无环图.您几乎可以 100% 确定这些关系.(即使不是原始推文,您也应该关注链接,将这些推文添加到您正在考虑的状态更新集中.)

The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id links to build many directed acyclic graphs. These relationships you can be nearly 100% sure about. (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.)

除了这个简单的步骤之外,还必须处理提及".与电子邮件线程不同,没有什么比可以匹配的主题行更有帮助的了 - 这不可避免地非常容易出错.我将采用的方法是为状态 ID 之间的每种可能关系创建一个特征向量,这些关系可能由该推文中的提及表示,然后训练分类器来猜测最佳选项,包括无回复"选项.

Beyond that easy step, one has to do deal with the "mentions". Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option.

要计算出所有可能的关系",首先要考虑提及一个或多个其他用户且不包含 in_reply_to_status_id 的每个状态更新.假设这些推文之一的示例是:1

To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id. Suppose an example of one of these tweets is: 1

@a @b no it isn't lol  RT @c Yes, absolutely. /cc @stephenfry

...您将为此次更新与@a@b@a@b@c@stephenfry 上周(比如说)以及该更新和特殊的无回复"更新之间的一个.然后你必须创建一个特征向量 - 你可以添加任何你想要的东西,但我至少建议添加:

... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of @a, @b, @c, and @stephenfry for the last week (say) and one between that update and a special "no reply" update. Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding:

  • 两次更新之间经过的时间 - 据推测,回复更有可能是最近的更新.
  • 推文中出现提及的字词所占的比例.例如如果这是第一个词,这将是 0 分,这可能比更新中稍后提到的更有可能表示回复.
  • 所提及用户的关注者数量 - 名人可能更有可能被提及垃圾邮件.
  • 更新之间最长公共子串的长度,可能表示直接引用.
  • 提及之前是否有/cc"或其他表示这不是直接回复该人的符号?
  • 原始更新作者的关注/关注比率.

这些越多越好,因为分类器只会使用那些被证明是有用的.我建议尝试使用 随机森林 分类器,它在 Weka.

The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. I'd suggest trying a random forest classifier, which is conveniently implemented in Weka.

下一个需要一个训练集.一开始这可能很小——刚好足以获得一个识别对话的服务.对于这一基本服务,必须添加一个漂亮的界面来纠正不匹配或错误链接的更新,以便用户可以纠正它们.使用这些数据可以构建更大的训练集和更准确的分类器.

Next one needs a training set. This can be small at first - just enough to get a service that identifies conversations up-and-running. To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. Using this data one can build a bigger training set and a more accurate classifier.

1 ... 这可能是 Twitter 上典型的话语水平 ;)

这篇关于用于线程推文的一组好的启发式方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆