我如何阅读和解析R中网页的内容 [英] How can I read and parse the contents of a webpage in R

查看：124 发布时间：2018/6/15 10:16:36 html r screen-scraping html-content-extraction

本文介绍了我如何阅读和解析R中网页的内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想要阅读网址的内容（例如， http://www.haaretz.com/ ）in R.我在想如何做到这一点

解决方案

不太清楚你想如何处理那个页面，因为它真的很混乱。当我们在这个着名的stackoverflow问题中重新学习时，在html上使用正则表达式不是一个好主意，所以你一定要用XML包来解析它。

下面是一个例子，让你开始：

  require（RCurl ）
 require（XML）
网页<  -  getURL（http://www.haaretz.com/）
网页<  -  readLines（tc<  -  textConnection（网页） ）; close（tc）
 pagetree<  -  htmlTreeParse（webpage，error = function（...）{}，useInternalNodes = TRUE）
＃通过表格解析树
x<  -  xpathSApply 
＃使用正则表达式进行一些清理
x<  -  unlist（strsplit（x，\\\
））
x< -  gsub（\ t，，x）
x < -  sub（^ [[：space：]] *（。*？）[[：space：]] * $， \\ 1，x，perl = TRUE）
x < -  x [！（x％in％c（，|））] 
  
 
 
 这会导致一个字符向量，主要是网页文本（以及一些javascript）：
 
 
 >头像（x）
 [1]订阅印刷版2009年12月4日星期五Kislev 17,5770以色列时间：16：48（EST + 7）
 [4] ]使Haaretz成为您的主页/ *检查搜索表单* /函数chkSearch（）
  
 
I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it
 解决方案 
Not really sure how you want to process that page, because it's really messy.  As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.  

Here's an example to get you started:
require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
This results in a character vector of mostly just webpage text (along with some javascript):
> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:Â 16:48Â (EST+7)"           
[4] "Â Â Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 


                        
这篇关于我如何阅读和解析R中网页的内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我如何阅读和解析R中网页的内容 [英] How can I read and parse the contents of a webpage in R

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

我如何阅读和解析R中网页的内容 [英] How can I read and parse the contents of a webpage in R

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭