R删除标题,评论数和“喜欢”的博客 [英] R scrape a blog for Title, number of comments and 'likes'

查看:294
本文介绍了R删除标题,评论数和“喜欢”的博客的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用R从几个博客中获取一些信息。我想抓取的数据是:

I'm trying to use R to grab some information from a few blogs. The data I'd like to grab is:

1) Date posted
2) Blog Post Title
3) Number of Comments
4) Number of Facebook likes.

此网志这里有我想要收集的所有字段。

This blog here has all the fields I'm looking to collect.

理想情况下,我想要一个数据框,如下所示:

Ideally I'd like a data frame that looks like this:

Post_Date      CommentCount       FB_Likes   Title
2012-12-05          1                 629      The James and Claudia Kripalu Workshop– The Daily Practice: Finding Success From Within
  ...              ...                ...          ...

在R做这个?它似乎可以用 RCurl 来做,但我不太熟悉 html / xml / js / etc

Is there a way to do this in R? It seems like something that might be doable with RCurl but I'm not too familiar with html/xml/js/etc.

到目前为止,这是我有的:

So far this is what I have:

library(RCurl)
library(XML)
xmlTreeParse(getURI("http://www.jamesaltucher.com"))

当我运行这个时,我收到开始和结束括号不匹配的错误。

when I run this I get errors that the opening and closing brackets don't match.

注意:这些不是我的博客,所以我没有管理员访问博客或他们的FB帐户。

NOTE: These are not my blogs so I don't have admin access to the blog or their FB account.

推荐答案

很难得到facebook。
我插入看看一个解决方案。我用gsub处理日期以获得漂亮的格式。

It is hard to get facebook like. I am intersting to see a solution. I treat dates with gsub to get pretty format.

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//*[@class='article']/h2/a",xmlValue)             ## titles
dates   <- xpathSApply (blog ,"//*[@class='article']/h2/span/text()",
             function(x) {
                 y <- gsub('.*on(.*)Post.*','\\1',xmlValue(x))
               }
             )
dates <- dates[dates != 'Posted by ']

这篇关于R删除标题,评论数和“喜欢”的博客的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆