如何在 R(https 链接)中抓取安全页面(使用 XML 包中的 readHTMLTable)? [英] How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

查看:33
本文介绍了如何在 R(https 链接)中抓取安全页面(使用 XML 包中的 readHTMLTable)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于如何使用 XML 包中的 readHTMLTable 有很好的答案,我使用常规的 http 页面做到了这一点,但是我无法解决 https 页面的问题.

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https pages.

我正在尝试读取此网站上的表格(网址字符串):

I am trying to read table on this website (url string):

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

但我收到此错误:文件 https://ned.nih.gov/search/Vi...不存在.

But I get this error: File https://ned.nih.gov/search/Vi...does not exist.

我试图通过这个解决 https 问题(下面的前 2 行)(从使用谷歌找到解决方案(比如这里:http://tonybreyal.wordpress.com/2012/01/13/ra-quick-scrape-票房收入最高的电影来自 boxofficemojo-com/).

I tried to get past the https problem with this (first 2 lines below)(from using google to find solution (like here:http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/).

此技巧有助于查看更多页面,但任何提取表格的尝试均无效.任何建议表示赞赏.我需要诸如组织、组织职务、经理之类的表格字段.

This trick helps to see more of the page, but any attempts to extract the table are not working. Any advice appreciated. I need the table fields like Organization, Organizational Title, Manager.

 #attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

推荐答案

新包 httrRCurl 提供了一个包装器,可以更轻松地抓取各种页面.

The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

不过,这个页面给我带来了很多麻烦.以下是有效的,但毫无疑问还有更简单的方法.

Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

结果:

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

<小时>

在此处获取 httr:http://cran.r-project.org/web/packages/httr/index.html

有用的页面,其中包含有关 RCurl 包的常见问题解答:http://www.omegahat.org/RCurl/FAQ.html

Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html

这篇关于如何在 R(https 链接)中抓取安全页面(使用 XML 包中的 readHTMLTable)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆