如何从 R 中抓取的网页中隔离单个元素 [英] How to isolate a single element from a scraped web page in R
问题描述
我想用 R 来抓取这个页面:(http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) 和其他人,以获取进球者和次数.
到目前为止,这就是我所拥有的:
require(RCurl)要求(XML)theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"网页 <- getURL(theURL, header=FALSE,verbose=TRUE)webcont <- readLines(tc <- textConnection(webpage));关闭(tc)pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)
并且 pagetree 对象现在包含一个指向我解析的 html 的指针(我认为).我想要的部分是:
<div class="bold medium">进球数</div><li>Philipp LAHM (GER) 6', </li><li>Paulo WANCHOPE (CRC) 12', </li><li>Miroslav KLOSE (GER) 17', </li><li>Miroslav KLOSE (GER) 61', </li><li>Paulo WANCHOPE (CRC) 73', </li><li>Torsten FRINGS (GER) 87'</li></ul></div>但我现在不知道如何隔离它们,坦率地说,xpathSApply
和 xpathApply
把 beejeebies 搞糊涂了!
那么,有谁知道如何制定一个命令来吸出 <div class="cont">
标签中包含的元素?
解决方案 这些问题在 R 中处理 Web 抓取和 XML 时非常有用:
关于您的特定示例,虽然我不确定您希望输出是什么样的,但这将得分"作为字符向量:
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"fifa.doc <- htmlParse(theURL)FIFA <- xpathSApply(fifa.doc, "///*/div[@class='cont']", xmlValue)goal.scored <- grep("进球数", FIFA, value=TRUE)
xpathSApply
函数获取符合给定条件的所有值,并将它们作为向量返回.请注意我是如何寻找 class='cont' 的 div 的.使用类值通常是解析 HTML 文档的好方法,因为它们是很好的标记.
你可以随意清理:
<代码>>gsub("进球得分", "", strsplit(goals.scored, ", ")[[1]])[1]菲利普·拉姆(德国)6'"保罗·万乔普(CRC)12'"米罗斯拉夫·克洛斯(德国)17'"米罗斯拉夫·克洛斯(德国)61'"保罗·万乔普(CRC)73'"[6]托斯滕·弗林斯(德国)87'"
I want to use R to scrape this page: (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times.
So far, this is what I've got:
require(RCurl)
require(XML)
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE)
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)
and the pagetree object now contains a pointer to my parsed html (I think). The part I want is:
<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
<li>Philipp LAHM (GER) 6', </li>
<li>Paulo WANCHOPE (CRC) 12', </li>
<li>Miroslav KLOSE (GER) 17', </li>
<li>Miroslav KLOSE (GER) 61', </li>
<li>Paulo WANCHOPE (CRC) 73', </li>
<li>Torsten FRINGS (GER) 87'</li>
</ul></div>
But I'm now lost as to how to isolate them, and frankly xpathSApply
and xpathApply
confuse the beejeebies out of me!
So, does anyone know how to formulate a command to suck out the element contained within the <div class="cont">
tags?
解决方案 These questions are very helpful when dealing with web scraping and XML in R:
- Scraping html tables into R data frames using the XML package
- How to transform XML data into a data.frame?
With regards to your particular example, while I'm not sure what you want the output to look like, this gets the "goals scored" as a character vector:
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
fifa.doc <- htmlParse(theURL)
fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
goals.scored <- grep("Goals scored", fifa, value=TRUE)
The xpathSApply
function gets all the values that match the given criteria, and returns them as a vector. Note how I'm looking for a div with class='cont'. Using class values is frequently a good way to parse an HTML document because they are good markers.
You can clean this up however you want:
> gsub("Goals scored", "", strsplit(goals.scored, ", ")[[1]])
[1] "Philipp LAHM (GER) 6'" "Paulo WANCHOPE (CRC) 12'" "Miroslav KLOSE (GER) 17'" "Miroslav KLOSE (GER) 61'" "Paulo WANCHOPE (CRC) 73'"
[6] "Torsten FRINGS (GER) 87'"
这篇关于如何从 R 中抓取的网页中隔离单个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文