我如何阅读和解析R中网页的内容 [英] How can I read and parse the contents of a webpage in R

查看:124
本文介绍了我如何阅读和解析R中网页的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要阅读网址的内容(例如, http://www.haaretz.com/ )in R.我在想如何做到这一点

解决方案

不太清楚你想如何处理那个页面,因为它真的很混乱。当我们在这个着名的stackoverflow问题中重新学习时,在html上使用正则表达式不是一个好主意,所以你一定要用XML包来解析它。



下面是一个例子,让你开始:

  require(RCurl )
require(XML)
网页< - getURL(http://www.haaretz.com/)
网页< - readLines(tc< - textConnection(网页) ); close(tc)
pagetree< - htmlTreeParse(webpage,error = function(...){},useInternalNodes = TRUE)
#通过表格解析树
x< - xpathSApply
#使用正则表达式进行一些清理
x< - unlist(strsplit(x,\\\
))
x< - gsub(\ t,,x)
x < - sub(^ [[:space:]] *(。*?)[[:space:]] * $, \\ 1,x,perl = TRUE)
x < - x [!(x%in%c(,|))]



这会导致一个字符向量,主要是网页文本(以及一些javascript):

 >头像(x)
[1]订阅印刷版2009年12月4日星期五Kislev 17,5770以色列时间:16:48(EST + 7)
[4] ]使Haaretz成为您的主页/ *检查搜索表单* /函数chkSearch()


I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

解决方案

Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

Here's an example to get you started:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character vector of mostly just webpage text (along with some javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
[4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 

这篇关于我如何阅读和解析R中网页的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆