如何在 R(https 链接)中抓取安全页面(使用 XML 包中的 readHTMLTable)? [英] How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
问题描述
关于如何使用 XML 包中的 readHTMLTable 有很好的答案,我使用常规的 http 页面做到了这一点,但是我无法解决 https 页面的问题.
There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https pages.
我正在尝试读取此网站上的表格(网址字符串):
I am trying to read table on this website (url string):
library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)
但我收到此错误:文件 https://ned.nih.gov/search/Vi...不存在.
But I get this error: File https://ned.nih.gov/search/Vi...does not exist.
我试图通过这个解决 https 问题(下面的前 2 行)(从使用谷歌找到解决方案(比如这里:http://tonybreyal.wordpress.com/2012/01/13/ra-quick-scrape-票房收入最高的电影来自 boxofficemojo-com/).
I tried to get past the https problem with this (first 2 lines below)(from using google to find solution (like here:http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/).
此技巧有助于查看更多页面,但任何提取表格的尝试均无效.任何建议表示赞赏.我需要诸如组织、组织职务、经理之类的表格字段.
This trick helps to see more of the page, but any attempts to extract the table are not working. Any advice appreciated. I need the table fields like Organization, Organizational Title, Manager.
#attempt to get past the https problem
raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html;
...
h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...
推荐答案
新包 httr
为 RCurl
提供了一个包装器,可以更轻松地抓取各种页面.
The new package httr
provides a wrapper around RCurl
to make it easier to scrape all kinds of pages.
不过,这个页面给我带来了很多麻烦.以下是有效的,但毫无疑问还有更简单的方法.
Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.
library("httr")
library("XML")
# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
"https://ned.nih.gov/",
path="search/ViewDetails.aspx",
query="NIHID=0010121048",
config(cainfo = cafile)
)
# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)
# Parse the table
readHTMLTable(tab)
结果:
$ctl00_ContentPlaceHolder_dvPerson
V1 V2
1 Legal Name: Dr Francis S Collins
2 Preferred Name: Dr Francis Collins
3 E-mail: francis.collins@nih.gov
4 Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5 Mail Stop: Â
6 Phone: 301-496-2433
7 Fax: Â
8 IC: OD (Office of the Director)
9 Organization: Office of the Director (HNA)
10 Classification: Employee
11 TTY: Â
<小时>
在此处获取 httr
:http://cran.r-project.org/web/packages/httr/index.html
有用的页面,其中包含有关 RCurl
包的常见问题解答:http://www.omegahat.org/RCurl/FAQ.html
Useful page with FAQ about the RCurl
package: http://www.omegahat.org/RCurl/FAQ.html
这篇关于如何在 R(https 链接)中抓取安全页面(使用 XML 包中的 readHTMLTable)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!