在Kmeans中找到推文和集群的Jaccard距离 [英] Find Jaccard distance of tweets and cluster in Kmeans
问题描述
这是我已经解决了一段时间的问题的后续问题.我有两个问题.人们认为一种适用于两条推文的算法,我对其进行了修订,以测量10条推文.我想知道我的修订正在衡量什么.我得到结果,但是我希望它能够测量多个tweet的jaccard距离,而不仅仅是返回一个值.由于它返回的是一个值,因此我认为它只是将所有内容加起来.另一个问题是关于我尝试创建For循环并分配群集的信息.
This is a follow up question to a problem I've been working on for a while. I have two questions. One regards an algorithm that works on two tweets, that I revised to measure 10 tweets. I'm wondering what my revision is measuring. I get result, but I want it to measure several tweet's jaccard distances, not just return one value. Since it's returning one value, I think it's just adding everything up. The other question is about my attempt to create a For Loop and assign clusters.
我正在尝试查找推文数据集之间的Jaccard距离,然后使用Kmeans算法对这些推文进行聚类.
I'm trying to find the Jaccard distance between a dataset of tweets, then cluster those tweets with the Kmeans algorithm.
这是我从中检索数据的地方: http://www3.nd.edu/~dwang5/courses /spring15/assignments/A2/Tweets.json
This is where I'm retrieving the data from: http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
到目前为止,我有这个
install.packages("rjson")
library("rjson")
#download JSON File and put into a dataframe
download.file("http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json", tf<-tempfile());library(jsonlite);json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file(tf)),collapse=",")))
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
# *Works. Compares two tweets and measures Jaccard Distance
tweetText <- list(tweet1 = tweet.features$text[1]:tweet.features$text[2])
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
所有这些度量两个指定的tweet的jaccard距离.太好了
All of that measures the jaccard distance of two specified tweets. Which is great.
但是现在我试图进行修改以比较多个tweet之间的距离.这次,我从R中的Sample命令中检索了10条随机推文.
But now I'm trying to modify to compare the distances between several tweets. This time 10 random tweets which I retrieved from the Sample command in R.
# Generates two sets of 5 random tweets
tweetText <- list(sample(tweet.features$text, replace = FALSE, size = 5), sample(tweet.features$text, replace = FALSE, size = 5))
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
这给了我结果,但这是不正确的.
This gives me results, but it can't be correct.
我正在尝试构建一种算法,该算法可以测量所有tweet,比较它们的jaccard距离,然后基于Jaccard距离和Kmeans进行聚类.
I'm trying to build an algorithm that can measure all the tweets, compare their jaccard distance, then cluster based on the Jaccard distance with Kmeans.
因此,为了再次尝试,我想做一个For循环.
So for another attempt, I thought to make a For Loop.
我决定用10条随机推文建立10个聚类中心
I decided to make 10 cluster centers with the 10 random tweets
c <- sample(tweet.features$text, replace = FALSE, size = 10)
现在我做了一个For Loop,希望能够测量我认为可以分配给数组和簇的推文
Now I did a For Loop hoping to measure the tweets which I figured I could assign to an array and cluster
#Algorithm attempt
for(i in tweet.features$text){
for (j in c){
i <- length(intersect(i, j))
u <- length(union(i, j))
j = i/u
}
#assign(my.array)
}
我不认为这会做任何有用的事情,但这是尝试创建一个循环来测量Jaccard距离.
I don't believe that's doing anything useful, but it's an attempt to create a loop to measure the Jaccard distance.
很抱歉,这是一个加载的问题.任何帮助将不胜感激,因为我有点迷茫.
I'm sorry that this is a loaded question. Any help would be appreciated as I'm a bit lost.
推荐答案
在您的第一个功能中,您正在对推文中的单词列表进行unlist
的处理,因此在tw1
和tw2
中具有全局列表的单词,您不能将其用于逐个Twitter的Jaccard.您可以通过删除unlist
来完成此操作,然后tw1
和tw2
是术语列表的列表,并且可以使用mapply
进行比较.如下.
In your first function, you are doing unlist
of your lists of words in your tweets, thus in tw1
and tw2
have the global lists of words and you cannot use them for your tweet-by-tweet Jaccard. You could accomplish it with a removing the unlist
, then tw1
and tw2
are lists of lists of terms, and you can compare them using mapply
. Something as follows.
jaccard_i <- function(tw1, tw2){
tw1 <- strsplit(tw1, " |\\.")
tw2 <- strsplit(tw2, " |\\.")
i <- mapply(function(tw1, tw2) {
length(intersect(tw1, tw2))
}, tw1=tw1, tw2=tw2)
u <- mapply(function(tw1, tw2) {
length(union(tw1, tw2))
}, tw1=tw1, tw2=tw2)
list(i=i, u=u, j=i/u)
}
傻的例子:
> tw1 = c("we yes you no", "we are the people")
> tw2= c("we are the people", "we yes you no")
> tweetText = list(tw1, tw2)
> jaccard_i(tweetText[[1]], tweetText[[2]])
$i
[1] 1 1
$u
[1] 7 7
$j
[1] 0.1428571 0.1428571
关于您问题的第二部分,双循环,一种简单的解决方法就是这样,
As for the second part of your question, the double loop, an easy way to start addressing it would be like this,
tw = c("we yes you no", "we are the people")
lapply(tw, function(tweet1) {
lapply(tw, function(tweet2) {
jaccard_i(tweet1, tweet2)
})
})
结果看起来像这样,
[[1]]
[[1]][[1]]
[[1]][[1]]$i
[1] 4
[[1]][[1]]$u
[1] 4
[[1]][[1]]$j
[1] 1
[[1]][[2]]
[[1]][[2]]$i
[1] 1
[[1]][[2]]$u
[1] 7
[[1]][[2]]$j
[1] 0.1428571
[[2]]
[[2]][[1]]
[[2]][[1]]$i
[1] 1
[[2]][[1]]$u
[1] 7
[[2]][[1]]$j
[1] 0.1428571
[[2]][[2]]
[[2]][[2]]$i
[1] 4
[[2]][[2]]$u
[1] 4
[[2]][[2]]$j
[1] 1
您当然应该跳过对角线中的值-就像我说的那样只是一个起点.
where you should skip the values in the diagonals of course -just a starting point as I said.
希望有帮助.
这篇关于在Kmeans中找到推文和集群的Jaccard距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!