在 R 中抓取受密码保护的网站 [英] Scrape password-protected website in R

查看:67
本文介绍了在 R 中抓取受密码保护的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 R 中受密码保护的网站抓取数据.阅读周围,似乎 httr 和 RCurl 包是使用密码身份验证进行抓取的最佳选择(我还研究了 XML 包).

I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).

我要抓取的网站如下(您需要一个免费帐户才能访问完整页面):http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2

The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2

这是我的两次尝试(用我的用户名替换用户名",用我的密码替换密码"):

Here are my two attempts (replacing "username" with my username and "password" with my password):

#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))

#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))

我查看了其他相关帖子(下面的链接),但不知道如何将他们的答案应用到我的案例中.

I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.

如何使用 R 从需要 cookie 的 SSL 页面下载压缩文件

如何进行网页抓取R 中的安全页面(https 链接)(使用 XML 包中的 readHTMLTable)?

从受密码保护的站点读取信息

R - RCurl 从受密码保护的站点抓取数据

http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold

推荐答案

我没有要测试的帐户,但也许这会奏效:

I don't have an account to test with, but maybe this will work:

library(httr)
library(XML)

handle <- handle("http://subscribers.footballguys.com") 
path   <- "amember/login.php"

# fields found in the login form.
login <- list(
  amember_login = "username"
 ,amember_pass  = "password"
 ,amember_redirect_url = 
   "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)

response <- POST(handle = handle, path = path, body = login)

现在,响应对象可能包含您需要的内容(或者您可以在登录请求后直接查询感兴趣的页面;我不确定重定向是否有效,但它是网络表单中的一个字段),并且handle 可能会重新用于后续请求.无法测试;但这在很多情况下都适用于我.

Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle might be re-used for subsequent requests. Can't test it; but this works for me in many situations.

您可以使用XML

> readHTMLTable(content(response))[[1]][1:5,]
  Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

这篇关于在 R 中抓取受密码保护的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆