如何计算Twitter中两个用户的相似度 [英] how to calculate the similarity of two users in Twitter

查看:175
本文介绍了如何计算Twitter中两个用户的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开展一个关于数据挖掘的项目。我的公司给了我600万个推特的虚拟客户信息。我被分配找出任何两个用户之间的相似性。任何人都可以给我一些如何处理大型社区数据的想法?提前致谢



问题:我使用推文和主题标签信息(主题标签是用户突出显示的那些词)作为衡量两个不同用户之间相似性的两个标准。由于用户数量众多,特别是每个用户可能有数百万个hastags和tweet。谁能告诉我一个快速计算两个用户之间相似性的好方法?我曾尝试使用FT-IDF来计算两个不同用户之间的相似性,但似乎不可行。任何人都可以有一个非常超级的算法或好的想法,可以让我快速找到用户之间的所有相似之处吗?



例如:

用户一个''hashtag = {cat,bull,cow,chicken,duck}

用户B'的hashtag = {cat,chicken,cloth}

用户C' 's hashtag = {lenovo,Hp,Sony}



显然,C与A没有关系,所以没有必要计算相似的浪费时间,我们可以在计算相似度之前先过滤掉所有那些不相关的用户。实际上,超过90%的总用户与特定用户无关。如何使用hashtag作为标准来快速找到那些潜在的类似用户组A?这是一个好主意吗?或者我们只是直接计算A和所有其他用户之间的相对相似度?什么算法是问题的最快和定制的算法?

I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A''s hashtag = {cat, bull, cow, chicken, duck}
user B''s hashtag ={cat, chicken, cloth}
user C''s hashtag = {lenovo, Hp, Sony}

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

推荐答案

如果我们想要群组,那么我们正在谈论群集。有许多 [ ^ ]。

大多数聚类算法都使用距离测量。当然,我们可以想象的最简单的距离测量是笛卡尔坐标,但如果我们处于复杂的空间,我们必须寻找更好的距离。由于距离测量仅涉及两个实体,因此不难找出一个实体。但很难找到最合适的!

在您的情况下,可能的措施可能是常见主题标签的数量 - 同时考虑到它们的数量。首先,您需要分配分配给人员的主题标签。通过这种方式,您可以计算出与常用标签数量相比可以分配给标签数量的权重。这样的事情:

- 让我们说,主题标签的最大值是20,这个数字的分布在样本中是线性的

- 所以,最多遥远的人是那些有20-20个标签的人,而且没有一个是常见的

- 我认为,这个距离必须小于人们只有1-的情况下的最大距离1个标签,那些是不同的

- 下一个距离步骤是1-1常用标签

- 最接近的那些,有20个标签,所有常见的

- 但标签的数量不如普通标签的数量那么重要,因为标签号很低,而且是线性的。

所以你应该找出一个指定数字的计算这个逻辑。

但是你可能需要测试几种聚类方法和测量,直到你对结果满意为止。



这可能很有用: http://msdn.microsoft。 com / en-us / library / ms174879(v = sql.105).aspx [ ^ ]。
If we want groups, than we are talking about clustering. There are many[^] of them.
Most of the clustering algorithms are using distance measures. Of course, the simplest distance measure we can imagine is the Cartesian one but if we are in a complex space, we have to look for better ones. Since a distance measure involves only two entities, it is not hard to figure out one. But it is hard to find the most suitable!
In your case a possible measure could be the number of common hashtags - taking into account the number of them also. So first of all, you need the distribution of the hashtags assigned to the people. This way you can figure out the weight you can assign to the number of tags compared to the number of common tags. Something like this:
- let''s say, the maximum of hashtags is 20, and the distribution of this number is linear across the sample
- so, the most distant people are the ones, that have 20-20 tags, and none of them is common
- I think, that this distance has to be less than the maximum in case of the people having only 1-1 tags, and those are different
- the next distance step is 1-1 common tags
- the closest are those, that have 20 tags, all common
- but the number of tags is less important as the number of common tags, since the tag number is low, and linear.
So you should figure out a calculation that assigns a number to this logic.
But you will probably need to test several clustering methods and measures too until you will be satisfied with the result.

This might be useful: http://msdn.microsoft.com/en-us/library/ms174879(v=sql.105).aspx[^].


ldaneil305



我可能会将主题标签插入数据库。从那里你可以运行查询以了解人们对你的相似程度。



我会创建3个简单的表格如下



UserTable

用户ID

用户名



HashtagTable

HashtagID

标签



用户标签

用户ID

HashtagID



您需要确定用户相似度的阈值。是两个相似的标签还是三个?



请尝试以下查询。您将不得不遍历每个用户以查找类似的用户,但它似乎确实有用。我确定有更快的方法...



ldaneil305

I would probably insert the hashtags into a database. From there you can run queries to find out how similar people are to you.

I would create 3 simple tables as follows

UserTable
UserID
UserName

HashtagTable
HashtagID
Hashtag

UserHashtag
UserID
HashtagID

You need to decide what the threshold for user similarity is. Is it two similar tags or three?

Try the query below. You''ll have to loop through each of your users to find similar users, but it does appear to work. I''m sure there are faster ways...

SELECT HashtagID,
            COUNT(HashtagID)
from UserHashtag
WHERE HashtagID IN (
                    SELECT HashtagID
                    FROM UserHashtag
                    WHERE UserID = 1
                   ) --Get all the tags that belong to this user.
and UserID != 1             --don't match the current user
HAVING COUNT(HashtagID) > 2 --For 3 or more matches
GROUP BY UserID 
order by COUNT(HashtagID) 





祝你好运!



Hogan



Good luck!

Hogan


由于用户数量众多,每个用户的主题标签也可能非常大。如何快速排序/计算两个用户之间的标签相似性。让我们说相似度计算只是简单:( 2 *常用主题标签的数量)/(A的主题标签的数量+ B的主题标签数量)。实际上,这里的主要问题是排序问题。
Due to the large number of users, the hashtag for each user also can be very large amount. How to fast sort/compute the hashtag similarity between two users. let''s say the similarity computation is just simply : (2 * the number of common hashtags)/(the toatal number of hashtags of A + the toatal number of hashtags of B). In fact, the main problem here is a sorting problem.


这篇关于如何计算Twitter中两个用户的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆