R (2.15.2) twitteR 包中的 searchTwitter() - 大量重复的推文 [英] searchTwitter() in twitteR package for R (2.15.2) - high number of duplicate tweets
问题描述
试图通过从 Twitter REST API 拉取来创建与关键字关联的 Twitter 用户名数据框.但是在许多搜索词(例如 #rstats
)上使用 searchTwitter()
的查询,即使对于像 n = 1000
这样的大样本,也返回高度(>90%) 的重复推文.
Trying to create a dataframe of Twitter usernames associated with keyword through pulls from the Twitter REST API. But queries using searchTwitter()
on many search terms (e.g. #rstats
), even for large samples like n = 1000
, return high degree (>90%) of duplicate tweets.
一个具体的例子是:
tweets <- searchTwitter("#rstats", n = 1000)
tweets.df <- do.call("rbind", lapply(tweets, as.data.frame))
df.undup <- df[duplicated(tweets.df) == FALSE,]
dim(df.undup)
如果搜索词相对稀少,我想知道这是否是由于分页限制造成的?
I'm wondering if this is caused by limits on pagination if the search term is relatively scarce?
推荐答案
首先,代码中的第 3 行应该是 df.undup <- tweets.df[duplicated(tweets.df) ==错误,]
?
First of all, should the 3rd line in your code be df.undup <- tweets.df[duplicated(tweets.df) == FALSE,]
?
我猜你得到的推文少于 1000 条,当你运行上面的代码时(我得到 604,dim(df.undup)
的结果是 604 10
代码>).因此,我想问题不在于存在重复,而在于推文数量少于 1000.
I guess you're getting less than 1000 tweets, when you run the above code (I got 604, and the result of dim(df.undup)
is 604 10
). So the problem, I guess, is not that of duplicates being there, but that there are lesser number of tweets than 1000.
如果您查看创建日期,最早的推文来自 3 月 14 日(一周前).Twitter API 通常不允许访问超过 7-9 天的推文.我想这就是为什么你收到的推文数量较少的原因.
If you look at the created date, the oldest tweets are from 14th March (a week ago). Twitter API usuallly usually doesn't allow one to access tweets more than 7-9 days old. I guess that's why you're getting a lesser number of tweets.
要检查,查看 dim(tweets.df)
和 dim(undup.df)
是否返回相同的内容.
To check, see if dim(tweets.df)
and dim(undup.df)
return the same thing.
这篇关于R (2.15.2) twitteR 包中的 searchTwitter() - 大量重复的推文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!