在Kmeans中找到推文和集群的Jaccard距离 [英] Find Jaccard distance of tweets and cluster in Kmeans

查看:133
本文介绍了在Kmeans中找到推文和集群的Jaccard距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我已经解决了一段时间的问题的后续问题.我有两个问题.人们认为一种适用于两条推文的算法,我对其进行了修订,以测量10条推文.我想知道我的修订正在衡量什么.我得到结果,但是我希望它能够测量多个tweet的jaccard距离,而不仅仅是返回一个值.由于它返回的是一个值,因此我认为它只是将所有内容加起来.另一个问题是关于我尝试创建For循环并分配群集的信息.

This is a follow up question to a problem I've been working on for a while. I have two questions. One regards an algorithm that works on two tweets, that I revised to measure 10 tweets. I'm wondering what my revision is measuring. I get result, but I want it to measure several tweet's jaccard distances, not just return one value. Since it's returning one value, I think it's just adding everything up. The other question is about my attempt to create a For Loop and assign clusters.

我正在尝试查找推文数据集之间的Jaccard距离,然后使用Kmeans算法对这些推文进行聚类.

I'm trying to find the Jaccard distance between a dataset of tweets, then cluster those tweets with the Kmeans algorithm.

这是我从中检索数据的地方: http://www3.nd.edu/~dwang5/courses /spring15/assignments/A2/Tweets.json

This is where I'm retrieving the data from: http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json

到目前为止,我有这个

    install.packages("rjson")
library("rjson")


#download JSON File and put into a dataframe
download.file("http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json", tf<-tempfile());library(jsonlite);json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file(tf)),collapse=",")))

# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL

# *Works.   Compares two tweets and measures Jaccard Distance

tweetText <- list(tweet1 = tweet.features$text[1]:tweet.features$text[2])

jaccard_i <- function(tw1, tw2){
  tw1 <- unlist(strsplit(tw1, " |\\."))
  tw2 <- unlist(strsplit(tw2, " |\\."))
  i <- length(intersect(tw1, tw2))
  u <- length(union(tw1, tw2))

  list(i=i, u=u, j=i/u)
}

jaccard_i(tweetText[[1]], tweetText[[2]])

所有这些度量两个指定的tweet的jaccard距离.太好了

All of that measures the jaccard distance of two specified tweets. Which is great.

但是现在我试图进行修改以比较多个tweet之间的距离.这次,我从R中的Sample命令中检索了10条随机推文.

But now I'm trying to modify to compare the distances between several tweets. This time 10 random tweets which I retrieved from the Sample command in R.

# Generates two sets of 5 random tweets
tweetText <- list(sample(tweet.features$text, replace = FALSE, size = 5), sample(tweet.features$text, replace = FALSE, size = 5))

jaccard_i <- function(tw1, tw2){
  tw1 <- unlist(strsplit(tw1, " |\\."))
  tw2 <- unlist(strsplit(tw2, " |\\."))
  i <- length(intersect(tw1, tw2))
  u <- length(union(tw1, tw2))

  list(i=i, u=u, j=i/u)
}

jaccard_i(tweetText[[1]], tweetText[[2]])

这给了我结果,但这是不正确的.

This gives me results, but it can't be correct.

我正在尝试构建一种算法,该算法可以测量所有tweet,比较它们的jaccard距离,然后基于Jaccard距离和Kmeans进行聚类.

I'm trying to build an algorithm that can measure all the tweets, compare their jaccard distance, then cluster based on the Jaccard distance with Kmeans.

因此,为了再次尝试,我想做一个For循环.

So for another attempt, I thought to make a For Loop.

我决定用10条随机推文建立10个聚类中心

I decided to make 10 cluster centers with the 10 random tweets

c <- sample(tweet.features$text, replace = FALSE, size = 10)

现在我做了一个For Loop,希望能够测量我认为可以分配给数组和簇的推文

Now I did a For Loop hoping to measure the tweets which I figured I could assign to an array and cluster

#Algorithm attempt
for(i in tweet.features$text){
  for (j in c){
    i <- length(intersect(i, j))
    u <- length(union(i, j))
    j = i/u
  }
  #assign(my.array)
}

我不认为这会做任何有用的事情,但这是尝试创建一个循环来测量Jaccard距离.

I don't believe that's doing anything useful, but it's an attempt to create a loop to measure the Jaccard distance.

很抱歉,这是一个加载的问题.任何帮助将不胜感激,因为我有点迷茫.

I'm sorry that this is a loaded question. Any help would be appreciated as I'm a bit lost.

推荐答案

在您的第一个功能中,您正在对推文中的单词列表进行unlist的处理,因此在tw1tw2中具有全局列表的单词,您不能将其用于逐个Twitter的Jaccard.您可以通过删除unlist来完成此操作,然后tw1tw2是术语列表的列表,并且可以使用mapply进行比较.如下.

In your first function, you are doing unlist of your lists of words in your tweets, thus in tw1 and tw2 have the global lists of words and you cannot use them for your tweet-by-tweet Jaccard. You could accomplish it with a removing the unlist, then tw1 and tw2 are lists of lists of terms, and you can compare them using mapply. Something as follows.

jaccard_i <- function(tw1, tw2){
  tw1 <- strsplit(tw1, " |\\.")
  tw2 <- strsplit(tw2, " |\\.")
  i <- mapply(function(tw1, tw2) {
    length(intersect(tw1, tw2))
  }, tw1=tw1, tw2=tw2)
  u <- mapply(function(tw1, tw2) {
    length(union(tw1, tw2))
  }, tw1=tw1, tw2=tw2)
  list(i=i, u=u, j=i/u)
}

傻的例子:

> tw1 = c("we yes you no", "we are the people")
> tw2= c("we are the people", "we yes you no")
> tweetText = list(tw1, tw2)
> jaccard_i(tweetText[[1]], tweetText[[2]])
$i
[1] 1 1

$u
[1] 7 7

$j
[1] 0.1428571 0.1428571

关于您问题的第二部分,双循环,一种简单的解决方法就是这样,

As for the second part of your question, the double loop, an easy way to start addressing it would be like this,

tw = c("we yes you no", "we are the people")
lapply(tw, function(tweet1) {
  lapply(tw, function(tweet2) {
    jaccard_i(tweet1, tweet2)
  })
})

结果看起来像这样,

[[1]]
[[1]][[1]]
[[1]][[1]]$i
[1] 4

[[1]][[1]]$u
[1] 4

[[1]][[1]]$j
[1] 1


[[1]][[2]]
[[1]][[2]]$i
[1] 1

[[1]][[2]]$u
[1] 7

[[1]][[2]]$j
[1] 0.1428571



[[2]]
[[2]][[1]]
[[2]][[1]]$i
[1] 1

[[2]][[1]]$u
[1] 7

[[2]][[1]]$j
[1] 0.1428571


[[2]][[2]]
[[2]][[2]]$i
[1] 4

[[2]][[2]]$u
[1] 4

[[2]][[2]]$j
[1] 1

您当然应该跳过对角线中的值-就像我说的那样只是一个起点.

where you should skip the values in the diagonals of course -just a starting point as I said.

希望有帮助.

这篇关于在Kmeans中找到推文和集群的Jaccard距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆