什么是一套好的启发式的线程鸣叫? [英] What's a good set of heuristics for threading tweets?

查看:142
本文介绍了什么是一套好的启发式的线程鸣叫?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家都知道,如果你想线程使用杰米电子邮件 Zawinski撰写的算法。但是,这是一个新世纪,有一个 新的通讯服务。

Everyone knows, if you want to thread emails you use Jamie Zawinski's algorithm. But it's a new century, and there's a new messaging service.

什么是对线程状态更新最好的算法上发布 Twitter的?

What's the best algorithm for threading status updates posted on Twitter?

东西:

  • 最简单的部分:使用 in_reply_to_status_id in_reply_to_user_id in_reply_to_screen_name 。 (顺便说一下,发现这些值的适当文件 将本身有用!这些文档不 显然与从 此处, 例如。)

  • The easy part: using in_reply_to_status_id, in_reply_to_user_id and in_reply_to_screen_name. (Incidentally, finding proper documentation of these values would be useful in itself! Such documentation isn't obviously linked to from here, for example.)

好启发式推断,从答复的关系 中提到的 @ 会议用户,但都没有消息 明确回答一个特定的消息。这些 提到了设置在实体的元件 状态即时 如果你请求。这些启发式可能考虑到 帐户()两个状态更新之间的时间,(二) 有两个用户等之间subsquent答复 (回答说由旧式转推的有 额外的注释,如<一个href="http://stackoverflow.com/questions/3552646/whats-a-good-set-of-heuristics-for-threading-tweets/3552718#3552718">mentioned通过user85509 下面 在这种风格的答复只是一个实例。)

Good heuristics for inferring a "reply" relationship from messages that mention a user with the @ convention but aren't explicitly in reply to a particular message. These "mentions" are provided in the "entities" element of statuses now if you request that. These heuristics might take into account (a) the time between two status updates, (b) whether there are subsquent replies between the two users, etc. (Replies that consist of an old-style retweet with an additional comment, as mentioned by user85509 below are just an instance of this style of reply.)

这需要超过两个用户之间进行对话。

Conversations that take place between more than two users.

一组给定的算法鸣叫,或所有工作 鸣叫在Twitter上。

Working with a set of tweets given to the algorithm, or all tweets on Twitter.

...但也许你能想到的更多。

... but perhaps you can think of more.

推荐答案

因为只有过一个答案,而赏金大限将至很快,我想我应该加一个基准答案,所以赏金不会自动授予的答案是不会增加太多超越什么的问题。

Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question.

最明显的第一步是把你原来设定的鸣叫并遵守所有 in_reply_to_status_id 链接建立多向无环图。这些关系,你可以接近100%肯定。 (您应该遵循的联系,甚至通过微博是不是在原来的设置,添加这些设定的,你考虑的状态更新。)

The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id links to build many directed acyclic graphs. These relationships you can be nearly 100% sure about. (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.)

除此之外简单的步骤,一个人做处理与提及。不同于电子邮件线程,没有什么有用的像一个主题,一个可以在匹配 - 这是的不可避免地的将是非常容易出错。我会采取的方法是创建一个状态ID之间的每一个可能的关系,可能会重新被提到在鸣叫psented $ P $的特征向量,然后训练分类猜测是最好的选择,其中包括无应答选项

Beyond that easy step, one has to do deal with the "mentions". Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option.

要制定出每一个可能的关系位,通过考虑提到一个或多个其他用户,并且不包含 in_reply_to_status_id 每一个状态更新开始。假设这些微博中的一个例子是: 1

To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id. Suppose an example of one of these tweets is: 1

@a @b no it isn't lol  RT @c Yes, absolutely. /cc @stephenfry

...你将创造此更新,并在 @a @b , @c @stephenfry 的最后一个星期(说)和一个更新和一个特殊的无应答的更新之间。然后,你必须创建一个特征向量 - 你可以添加到这个无论你想,但我至少会建议增加:

... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of @a, @b, @c, and @stephenfry for the last week (say) and one between that update and a special "no reply" update. Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding:

  • 这两个更新之间的时间间隔 - presumably答复更可能是最近的更新
  • 的方式,通过鸣叫中提到出现的单词方面的比例。例如如果这是第一个字,这将是0分,而可能是更有可能指示的应答比以后在更新提到。
  • 中提到的用户的追随者数量 - 名人是presumably更可能是垃圾邮件提到的
  • 更新之间的最长公共子串的长度,这可能表明直接引用。
  • 是提$ P $的/ CC或其他能指pceded,表明这不是直接到人的答复?
  • 以下/跟随比原来的更新作者。
  • The time that elapsed between the two updates - presumably replies are more likely to be to recent updates.
  • The proportion of the way through the tweet in terms of words that a mention occurs. e.g. if this is the first word, this would be a score of 0 and that's probably more likely to indicate a reply than mentions later in the update.
  • The number of followers of the mentioned user - celebrities are presumably more likely to be spam-mentioned.
  • The length of the longest common substring between the updates, which might indicate direct quoting.
  • Is the mention preceded by "/cc" or other signifiers that indicate that this isn't directly a reply to that person?
  • The following / followed ratio for the author of the original update.
  • etc.
  • etc.

这些中的一个更能够拿出更好,因为分类器将只使用那些变成是有用的。我会建议您尝试使用随机森林分类,这是在方便的Weka

The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. I'd suggest trying a random forest classifier, which is conveniently implemented in Weka.

接着一个人需要一个培训集。这可以是小的,在第一 - 刚好足以让一个标识对话向上和运行服务。这个基本的服务,人们必须添加一个漂亮的界面,纠正不匹配或虚假链接更新,让用户可以纠正。使用此数据可以建立一个更大的训练集和一个更精确的分类。

Next one needs a training set. This can be small at first - just enough to get a service that identifies conversations up-and-running. To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. Using this data one can build a bigger training set and a more accurate classifier.

1 <子> ...这可能是典型的话语在Twitter上水平;)

这篇关于什么是一套好的启发式的线程鸣叫?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆