规避循环功能中的错误(用于从Twitter提取数据) [英] Circumvent errors in loop function (used to extract data from Twitter)
问题描述
我创建了一个循环函数,该函数使用搜索api以一定间隔(每5分钟说一次)提取推文.此功能将执行其预期的操作:连接到twitter,提取包含特定关键字的tweet,并将其保存在csv文件中.但是由于以下两个错误之一,有时(一天2-3次)循环会停止:
I created a loop function that extract tweets using the search api with a certain interval (lets say every 5 min.). This function does what it suppose to do: connect to twitter, extracts tweets that contain a certain keyword, and saves them in a csv file. However occasionally (2-3 times a day) the loop is stopped because of one of these two errors:
-
htmlTreeParse(URL,useInternal = TRUE)中的
-
Error: 为 http://search.twitter.com/search.atom?q= <创建解析器时出错/a> 6.95322e-310tst& rpp = 100& page = 10
Error in htmlTreeParse(URL, useInternal = TRUE) : error in creating parser for http://search.twitter.com/search.atom?q= 6.95322e-310tst&rpp=100&page=10
错误: 没有适用于'xmlNamespaceDefinitions'的适用方法应用于以下对象 类"NULL"
Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
我希望您可以通过回答一些问题来帮助我解决这些错误:
I hope you can help me deal with these errors, by answering some of my questions:
- 是什么原因导致这些错误发生?
- 如何调整代码以避免这些错误?
- 如果遇到错误(例如通过使用Try函数),如何强制"循环继续运行?
我的功能(基于在线找到的几个脚本)如下:
My function (based on several scripts found online) is as follows:
library(XML) # htmlTreeParse
twitter.search <- "Keyword"
QUERY <- URLencode(twitter.search)
# Set time loop (in seconds)
d_time = 300
number_of_times = 3000
for(i in 1:number_of_times){
tweets <- NULL
tweet.count <- 0
page <- 1
read.more <- TRUE
while (read.more)
{
# construct Twitter search URL
URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
# fetch remote URL and parse
XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})
# Extract list of "entry" nodes
entry <- getNodeSet(XML, "//entry")
read.more <- (length(entry) > 0)
if (read.more)
{
for (i in 1:length(entry))
{
subdoc <- xmlDoc(entry[[i]]) # put entry in separate object to manipulate
published <- unlist(xpathApply(subdoc, "//published", xmlValue))
published <- gsub("Z"," ", gsub("T"," ",published) )
# Convert from GMT to central time
time.gmt <- as.POSIXct(published,"GMT")
local.time <- format(time.gmt, tz="Europe/Amsterdam")
title <- unlist(xpathApply(subdoc, "//title", xmlValue))
author <- unlist(xpathApply(subdoc, "//author/name", xmlValue))
tweet <- paste(local.time, " @", author, ": ", title, sep="")
entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
tweet.count <- tweet.count + 1
rownames(entry.frame) <- tweet.count
tweets <- rbind(tweets, entry.frame)
}
page <- page + 1
read.more <- (page <= 15) # Seems to be 15 page limit
}
}
names(tweets)
# top 15 tweeters
#sort(table(tweets$author),decreasing=TRUE)[1:15]
write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")
Sys.sleep(d_time)
} # end if
推荐答案
这是我使用try
解决Twitter API类似问题的解决方案.
Here's my solution using try
to a similar problem with the Twitter API.
我一直在向Twitter API询问一长串Twitter用户中每个人的关注者数量.当用户的帐户受到保护时,在输入try
函数之前,我将得到一个错误,并且循环将中断.使用try
可以使循环跳过列表中的下一个人,从而继续工作.
I was asking the Twitter API for the number of followers for each person in a long list of Twitter users. When a user has their account protected I would get an error and the loop would break before I put in the try
function. Using try
allowed the loop to keep working by skipping onto the next person on the list.
这是设置
# load library
library(twitteR)
#
# Search Twitter for your term
s <- searchTwitter('#rstats', n=1500)
# convert search results to a data frame
df <- do.call("rbind", lapply(s, as.data.frame))
# extract the usernames
users <- unique(df$screenName)
users <- sapply(users, as.character)
# make a data frame for the loop to work with
users.df <- data.frame(users = users,
followers = "", stringsAsFactors = FALSE)
这是带有try
的循环,用于处理错误,同时使用从Twitter API获得的关注者数量填充users $ followers
And here's the loop with try
to handle errors while populating users$followers with follower counts obtained from Twitter API
for (i in 1:nrow(users.df))
{
# tell the loop to skip a user if their account is protected
# or some other error occurs
result <- try(getUser(users.df$users[i])$followersCount, silent = TRUE);
if(class(result) == "try-error") next;
# get the number of followers for each user
users.df$followers[i] <- getUser(users.df$users[i])$followersCount
# tell the loop to pause for 60 s between iterations to
# avoid exceeding the Twitter API request limit
print('Sleeping for 60 seconds...')
Sys.sleep(60);
}
#
# Now inspect users.df to see the follower data
这篇关于规避循环功能中的错误(用于从Twitter提取数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!