规避循环功能中的错误(用于从Twitter提取数据) [英] Circumvent errors in loop function (used to extract data from Twitter)

查看:153
本文介绍了规避循环功能中的错误(用于从Twitter提取数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个循环函数,该函数使用搜索api以一定间隔(每5分钟说一次)提取推文.此功能将执行其预期的操作:连接到twitter,提取包含特定关键字的tweet,并将其保存在csv文件中.但是由于以下两个错误之一,有时(一天2-3次)循环会停止:

I created a loop function that extract tweets using the search api with a certain interval (lets say every 5 min.). This function does what it suppose to do: connect to twitter, extracts tweets that contain a certain keyword, and saves them in a csv file. However occasionally (2-3 times a day) the loop is stopped because of one of these two errors:

  • Error in htmlTreeParse(URL, useInternal = TRUE) : error in creating parser for http://search.twitter.com/search.atom?q= 6.95322e-310tst&rpp=100&page=10

错误: 没有适用于'xmlNamespaceDefinitions'的适用方法应用于以下对象 类"NULL"

Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

我希望您可以通过回答一些问题来帮助我解决这些错误:

I hope you can help me deal with these errors, by answering some of my questions:

  • 是什么原因导致这些错误发生?
  • 如何调整代码以避免这些错误?
  • 如果遇到错误(例如通过使用Try函数),如何强制"循环继续运行?

我的功能(基于在线找到的几个脚本)如下:

My function (based on several scripts found online) is as follows:

    library(XML)   # htmlTreeParse

    twitter.search <- "Keyword"

    QUERY <- URLencode(twitter.search)

    # Set time loop (in seconds)
    d_time = 300
    number_of_times = 3000

    for(i in 1:number_of_times){

    tweets <- NULL
    tweet.count <- 0
    page <- 1
    read.more <- TRUE

    while (read.more)
    {
    # construct Twitter search URL
    URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
    # fetch remote URL and parse
    XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})

    # Extract list of "entry" nodes
    entry     <- getNodeSet(XML, "//entry")

    read.more <- (length(entry) > 0)
    if (read.more)
    {
    for (i in 1:length(entry))
    {
    subdoc     <- xmlDoc(entry[[i]])   # put entry in separate object to manipulate

    published  <- unlist(xpathApply(subdoc, "//published", xmlValue))

    published  <- gsub("Z"," ", gsub("T"," ",published) )

    # Convert from GMT to central time
    time.gmt   <- as.POSIXct(published,"GMT")
    local.time <- format(time.gmt, tz="Europe/Amsterdam")

    title  <- unlist(xpathApply(subdoc, "//title", xmlValue))

    author <- unlist(xpathApply(subdoc, "//author/name",  xmlValue))

    tweet  <-  paste(local.time, " @", author, ":  ", title, sep="")

    entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
    tweet.count <- tweet.count + 1
    rownames(entry.frame) <- tweet.count
    tweets <- rbind(tweets, entry.frame)
    }
    page <- page + 1
    read.more <- (page <= 15)   # Seems to be 15 page limit
    }
    }

    names(tweets)

    # top 15 tweeters
    #sort(table(tweets$author),decreasing=TRUE)[1:15]

    write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")

    Sys.sleep(d_time)

    } # end if

推荐答案

这是我使用try解决Twitter API类似问题的解决方案.

Here's my solution using try to a similar problem with the Twitter API.

我一直在向Twitter API询问一长串Twitter用户中每个人的关注者数量.当用户的帐户受到保护时,在输入try函数之前,我将得到一个错误,并且循环将中断.使用try可以使循环跳过列表中的下一个人,从而继续工作.

I was asking the Twitter API for the number of followers for each person in a long list of Twitter users. When a user has their account protected I would get an error and the loop would break before I put in the try function. Using try allowed the loop to keep working by skipping onto the next person on the list.

这是设置

# load library
library(twitteR)
#
# Search Twitter for your term
s <- searchTwitter('#rstats', n=1500) 
# convert search results to a data frame
df <- do.call("rbind", lapply(s, as.data.frame)) 
# extract the usernames
users <- unique(df$screenName)
users <- sapply(users, as.character)
# make a data frame for the loop to work with 
users.df <- data.frame(users = users, 
                       followers = "", stringsAsFactors = FALSE)

这是带有try的循环,用于处理错误,同时使用从Twitter API获得的关注者数量填充users $ followers

And here's the loop with try to handle errors while populating users$followers with follower counts obtained from Twitter API

for (i in 1:nrow(users.df)) 
    {
    # tell the loop to skip a user if their account is protected 
    # or some other error occurs  
    result <- try(getUser(users.df$users[i])$followersCount, silent = TRUE);
    if(class(result) == "try-error") next;
    # get the number of followers for each user
    users.df$followers[i] <- getUser(users.df$users[i])$followersCount
    # tell the loop to pause for 60 s between iterations to 
    # avoid exceeding the Twitter API request limit
    print('Sleeping for 60 seconds...')
    Sys.sleep(60); 
    }
#
# Now inspect users.df to see the follower data

这篇关于规避循环功能中的错误(用于从Twitter提取数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆