使用 R 进行网页抓取 [英] Web-Scraping with R

查看:49
本文介绍了使用 R 进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在从网站抓取数据时遇到了一些问题.首先,我对网页抓取没有很多经验......我的计划是使用 R 从以下网站抓取一些数据:http://spiderbook.com/company/17495/details?rel=300795

I'm having some problems scraping data from a website. First, I have not a lot of experience with webscraping... My intended plan is to scrape some data using R from the following website: http://spiderbook.com/company/17495/details?rel=300795

特别是,我想提取本网站文章的链接.

Especially, I want to extract the links to the articles on this site.

我目前的想法:

xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext,  "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " "))) 
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x)) 

但这并没有带来预期的信息.在这里非常感谢一些帮助!谢谢!

But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!

最好的克里斯托夫

推荐答案

你选择了一个棘手的问题来学习.

You picked a tough problem to learn on.

本网站使用 javascript 加载文章信息.换句话说,链接加载一组脚本,这些脚本在页面加载时运行以获取信息(可能来自数据库)并将其插入到 DOM 中.htmlParse(...) 只是获取基本 html 并对其进行解析.所以你想要的链接根本不存在.

This site uses javascript to load the article information. In other words, the link loads a set of scripts which run when the page loads to grab the information (from a database, probably) and insert it into the DOM. htmlParse(...) just grabs the base html and parses that. So the links you want are simply not present.

AFAIK 解决此问题的唯一方法是使用 RSelenium 包.这个包本质上允许您通过看起来像浏览器模拟器的东西传递基本 html,它确实运行脚本.Rselenium 的问题在于您不仅需要下载软件包,还需要一个Selenium 服务器".此链接RSelenium.

AFAIK the only way around this is to use the RSelenium package. This package essentially allows you to pass the base html through what looks like a browser simulator, which does run the scripts. The problem with Rselenium is that you need not only to download the package, but also a "Selenium Server". This link has a nice introduction to RSelenium.

一旦你这样做了,在浏览器中检查源代码就会发现文章链接都在锚标记的href属性中,它具有class=doclink.这很容易使用 xPath 提取.NEVER NEVER NEVER 使用正则表达式解析 XML.

Once you've done that, inspection of the source in a browser shows that the article links are all in the href attribute of anchor tags which have class=doclink. This is straightforward to extract using xPath. NEVER NEVER NEVER use regex to parse XML.

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer()        # download Selenium Server, if not already presnet
startServer()           # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open()            # open connection
remDr$navigate(url)     # grab and process the page (including scripts)
doc   <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"                                                                                    
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
# [7] "http://www.calcharge.org/2014/07/"                                                                                    
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"

这篇关于使用 R 进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆