如何在没有按钮参数的 Rvest 包中提交登录表单 [英] How to submit login form in Rvest package w/o button argument
问题描述
我正在尝试使用 html_session() & 抓取需要身份验证的网页来自 rvest 包的 html_form().我发现了这个,例如由 Hadley Wickham 提供,但我无法根据我的情况对其进行定制.
I am trying to scrape a web page that requires authentication using html_session() & html_form() from the rvest package. I found this e.g. provided by Hadley Wickham, but am not able to customize it to my case.
united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
html_nodes("form") %>%
extract2(1) %>%
html_form() %>%
set_values(
`ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
`ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")
就我而言,我找不到要在表单中设置的值,因此我试图给用户并直接传递:set_values("email","password")
In my case, I can't find the values to set in the form, hence I am trying to give the user and pass directly: set_values("email","password")
我也不知道如何引用提交按钮,所以我尝试了:submit_form(帐户,登录)
I also don't know how to refer to submit button, so I tried: submit_form(account,login)
我为 submit_form 函数得到的错误是:名称错误(提交)[[1]]:下标越界
The error I got for the submit_form function is: Error in names(submits)[[1]] : subscript out of bounds
任何有关如何解决此问题的想法表示赞赏.谢谢
Any idea on how to go about this is appreciated. Thank you
推荐答案
目前这个 issue 和 open rvest 包中的noreferrer">问题 #159,这会导致表单中并非所有字段都具有 type
值的问题.此购买可能会在未来版本中修复.
Currently, this issue is the same as the open issue #159 in the rvest
package, which causes issues where not all fields in a form have a type
value. This buy may be fixed in a future release.
但是,我们可以通过猴子修补底层函数 rvest::submit_request
来解决这个问题.
However, we can work around the issue by monkey patching the underlying function rvest:::submit_request
.
核心问题是辅助函数is_submit
.最初,它是这样定义的:
The core problem is the helper function is_submit
. Initially, it's defined like this:
is_submit <- function(x) tolower(x$type) %in% c("submit",
"image", "button")
虽然这很合乎逻辑,但它在两种情况下都失败了:
As logical as this is, however, it fails in two scenarios:
- 没有
type
元素. type
元素是NULL
.
- There is no
type
element. - The
type
element isNULL
.
两者都发生在美联航登录表单上.我们可以通过在函数内部添加两个检查来解决这个问题.
Both of these happen to occur on the United login form. We can resolve this by adding two checks inside the function.
custom.submit_request <- function (form, submit = NULL)
{
is_submit <- function(x) {
if (!exists("type", x) | is.null(x$type)){
return(F);
}
tolower(x$type) %in% c("submit", "image", "button")
}
submits <- Filter(is_submit, form$fields)
if (length(submits) == 0) {
stop("Could not find possible submission target.", call. = FALSE)
}
if (is.null(submit)) {
submit <- names(submits)[[1]]
message("Submitting with '", submit, "'")
}
if (!(submit %in% names(submits))) {
stop("Unknown submission name '", submit, "'.
", "Possible values: ",
paste0(names(submits), collapse = ", "), call. = FALSE)
}
other_submits <- setdiff(names(submits), submit)
method <- form$method
if (!(method %in% c("POST", "GET"))) {
warning("Invalid method (", method, "), defaulting to GET",
call. = FALSE)
method <- "GET"
}
url <- form$url
fields <- form$fields
fields <- Filter(function(x) length(x$value) > 0, fields)
fields <- fields[setdiff(names(fields), other_submits)]
values <- pluck(fields, "value")
names(values) <- names(fields)
list(method = method, encode = form$enctype, url = url, values = values)
}
要进行猴子补丁,我们需要使用 R.utils
包(如果没有,请通过 install.packages("R.utils")
安装)).
To monkey patch, we need to use the R.utils
package (install via install.packages("R.utils")
if you don't have it).
library(R.utils)
reassignInPackage("submit_request", "rvest", custom.submit_request)
从那里,我们可以发出我们自己的请求.
From there, we can issue our own request.
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")
这行得通!
(好吧,有效"用词不当.由于美联航采用了更激进的身份验证要求——包括已知的浏览器——这会导致 301 Unauthorized
.但是,它修复了错误).
(Well, "works" is a misnomer. Due to United employing more aggressive authentication requirements -- including known browsers -- this results in a 301 Unauthorized
. However, it fixes the error).
一个完整的可重现示例涉及一些其他小的代码更改:
A full reproducible example involved a couple of other minor code changes:
library(magrittr)
library(rvest)
url <- "https://www.united.com/web/en-US/apps/account/account.aspx"
account <- html_session(url)
login <- account %>%
html_nodes("form") %>%
extract2(1) %>%
html_form() %>%
set_values(
`ctl00$ContentInfo$SignIn$onepass$txtField` = "USER",
`ctl00$ContentInfo$SignIn$password$txtPassword` = "PASS")
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")
这篇关于如何在没有按钮参数的 Rvest 包中提交登录表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!