R删除标题,评论数和“喜欢”的博客 [英] R scrape a blog for Title, number of comments and 'likes'
问题描述
我试图使用R从几个博客中获取一些信息。我想抓取的数据是:
I'm trying to use R to grab some information from a few blogs. The data I'd like to grab is:
1) Date posted
2) Blog Post Title
3) Number of Comments
4) Number of Facebook likes.
此网志这里有我想要收集的所有字段。
This blog here has all the fields I'm looking to collect.
理想情况下,我想要一个数据框,如下所示:
Ideally I'd like a data frame that looks like this:
Post_Date CommentCount FB_Likes Title
2012-12-05 1 629 The James and Claudia Kripalu Workshop– The Daily Practice: Finding Success From Within
... ... ... ...
在R做这个?它似乎可以用 RCurl
来做,但我不太熟悉 html / xml / js / etc
。
Is there a way to do this in R? It seems like something that might be doable with RCurl
but I'm not too familiar with html/xml/js/etc
.
到目前为止,这是我有的:
So far this is what I have:
library(RCurl)
library(XML)
xmlTreeParse(getURI("http://www.jamesaltucher.com"))
当我运行这个时,我收到开始和结束括号不匹配的错误。
when I run this I get errors that the opening and closing brackets don't match.
注意:这些不是我的博客,所以我没有管理员访问博客或他们的FB帐户。
NOTE: These are not my blogs so I don't have admin access to the blog or their FB account.
推荐答案
很难得到facebook。
我插入看看一个解决方案。我用gsub处理日期以获得漂亮的格式。
It is hard to get facebook like. I am intersting to see a solution. I treat dates with gsub to get pretty format.
library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/'
blog <- getURL(url.link)
blog <- htmlParse(blog, encoding = "UTF-8")
titles <- xpathSApply (blog ,"//*[@class='article']/h2/a",xmlValue) ## titles
dates <- xpathSApply (blog ,"//*[@class='article']/h2/span/text()",
function(x) {
y <- gsub('.*on(.*)Post.*','\\1',xmlValue(x))
}
)
dates <- dates[dates != 'Posted by ']
这篇关于R删除标题,评论数和“喜欢”的博客的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!