我如何阅读和解析R中网页的内容 [英] How can I read and parse the contents of a webpage in R
问题描述
我想要阅读网址的内容(例如, http://www.haaretz.com/ )in R.我在想如何做到这一点
不太清楚你想如何处理那个页面,因为它真的很混乱。当我们在这个着名的stackoverflow问题中重新学习时,在html上使用正则表达式不是一个好主意,所以你一定要用XML包来解析它。
下面是一个例子,让你开始:
require(RCurl )
require(XML)
网页< - getURL(http://www.haaretz.com/)
网页< - readLines(tc< - textConnection(网页) ); close(tc)
pagetree< - htmlTreeParse(webpage,error = function(...){},useInternalNodes = TRUE)
#通过表格解析树
x< - xpathSApply
#使用正则表达式进行一些清理
x< - unlist(strsplit(x,\\\
))
x< - gsub(\ t,,x)
x < - sub(^ [[:space:]] *(。*?)[[:space:]] * $, \\ 1,x,perl = TRUE)
x < - x [!(x%in%c(,|))]
这会导致一个字符向量,主要是网页文本(以及一些javascript):
>头像(x)
[1]订阅印刷版2009年12月4日星期五Kislev 17,5770以色列时间:16:48(EST + 7)
[4] ]使Haaretz成为您的主页/ *检查搜索表单* /函数chkSearch()
I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it
解决方案Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.
Here's an example to get you started:
require(RCurl) require(XML) webpage <- getURL("http://www.haaretz.com/") webpage <- readLines(tc <- textConnection(webpage)); close(tc) pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE) # parse the tree by tables x <- xpathSApply(pagetree, "//*/table", xmlValue) # do some clean up with regular expressions x <- unlist(strsplit(x, "\n")) x <- gsub("\t","",x) x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE) x <- x[!(x %in% c("", "|"))]
This results in a character vector of mostly just webpage text (along with some javascript):
> head(x) [1] "Subscribe to Print Edition" "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:Â 16:48Â (EST+7)" [4] "Â Â Make Haaretz your homepage" "/*check the search form*/" "function chkSearch()"
这篇关于我如何阅读和解析R中网页的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!