使用 rvest 或 httr 登录网页上的非标准表单 [英] Using rvest or httr to log in to non-standard forms on a webpage

查看:25
本文介绍了使用 rvest 或 httr 登录网页上的非标准表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 rvest 抓取需要在表单上使用电子邮件/密码登录的网页.

I am attempting to use rvest to spider a webpage that requires an email/password login on a form.

rm(list=ls())
library(rvest)

### Trying to sign into a form using email/password 

url       <-"http://www.perfectgame.org/"   ## page to spider
pgsession <-html_session(url)               ## create session
pgform    <-html_form(pgsession)[[1]]       ## pull form from session

set_values(pgform, `ctl00$Header2$HeaderTop1$tbUsername` = "myemail@gmail.com") 
set_values(pgform, `ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")

submit_form(pgsession,pgform,submit=`ctl00$Header2$HeaderTop1$Button1`)

这给了我以下错误消息:

This gives me the following error message:

Error in submit_request(form, submit) : 

未找到对象 'ctl00$Header2$HeaderTop1$Button1'

object 'ctl00$Header2$HeaderTop1$Button1' not found

如果我在没有指定提交参数的情况下提交表单,我会得到这个:

If I submit the form without specifying the submit parameter, I get this:

Submitting with 'ctl00$Header2$HeaderTop1$Button1'
Error in function (type, msg, asError = TRUE)  : <url> malformed

我也尝试将参数直接传递给 httr,如本问题所述:如何在 R 中发布一个简单的 HTML 表单?,但提交"参数不接受带有反引号 (``)、引号或不带任何引号的提交按钮:

I also tried passing the parameters directly to httr as mentioned in this question: How can I POST a simple HTML form in R?, but the "submit" parameter did not accept the submit button either with backwards quotes (``), quotation marks, or without any quotes:

library(httr)

url <- "http://www.perfectgame.org/Rankings/Players/Default.aspx?gyear=2015&num=500"

fd <- list(
    submit = `ctl00$Header2$HeaderTop1$Button1`,
    `ctl00$Header2$HeaderTop1$tbUsername`  = "myemail@gmail.com",
    `ctl00$Header2$HeaderTop1$tbPassword`  = "mypassword")

resp<-POST(url, body=fd, encode="form")
content(resp) 

关于如何从 R 会话登录并抓取登录墙后面的数据的任何想法?

Any ideas for how I can log in from an R session and spider the data that's behind the login wall?

推荐答案

您的 rvest 代码未存储修改后的表单,因此在您的示例中,您只是提交原始 pgform 没有填写值.试试:

Your rvest code isn't storing the modified form, so in you're example you're just submitting the original pgform without the values being filled out. Try:

library(rvest)

url       <-"http://www.perfectgame.org/"   ## page to spider
pgsession <-html_session(url)               ## create session
pgform    <-html_form(pgsession)[[1]]       ## pull form from session

# Note the new variable assignment 

filled_form <- set_values(pgform,
  `ctl00$Header2$HeaderTop1$tbUsername` = "myemail@gmail.com", 
  `ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")

submit_form(pgsession,filled_form)

我现在看到了一个不错的 200 状态代码响应而不是错误.请注意,因为所需的提交按钮似乎是第一个提交按钮,所以我们不需要将其作为参数提供,否则我们将只提供一个字符串(直引号,而不是反引号).

And I now see a nice 200 status code response instead of an error. Note that because the desired submit button appears to be the first submit button, we don't need to give it as an argument, but otherwise we'd just be giving it a a string (straight quotes, not back quotes).

这篇关于使用 rvest 或 httr 登录网页上的非标准表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆