如何在没有按钮参数的 Rvest 包中提交登录表单 [英] How to submit login form in Rvest package w/o button argument

查看:12
本文介绍了如何在没有按钮参数的 Rvest 包中提交登录表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 html_session() & 抓取需要身份验证的网页来自 rvest 包的 html_form().我发现了这个,例如由 Hadley Wickham 提供,但我无法根据我的情况对其进行定制.

I am trying to scrape a web page that requires authentication using html_session() & html_form() from the rvest package. I found this e.g. provided by Hadley Wickham, but am not able to customize it to my case.

united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
         html_nodes("form") %>%
         extract2(1) %>%
         html_form() %>%
         set_values(
                `ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
                `ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>% 
submit_form(login, "ctl00$ContentInfo$SignInSecure")

就我而言,我找不到要在表单中设置的值,因此我试图给用户并直接传递:set_values("email","password")

In my case, I can't find the values to set in the form, hence I am trying to give the user and pass directly: set_values("email","password")

我也不知道如何引用提交按钮,所以我尝试了:submit_form(帐户,登录)

I also don't know how to refer to submit button, so I tried: submit_form(account,login)

我为 submit_form 函数得到的错误是:名称错误(提交)[[1]]:下标越界

The error I got for the submit_form function is: Error in names(submits)[[1]] : subscript out of bounds

任何有关如何解决此问题的想法表示赞赏.谢谢

Any idea on how to go about this is appreciated. Thank you

推荐答案

目前这个 issue 和 open rvest 包中的noreferrer">问题 #159,这会导致表单中并非所有字段都具有 type 值的问题.此购买可能会在未来版本中修复.

Currently, this issue is the same as the open issue #159 in the rvest package, which causes issues where not all fields in a form have a type value. This buy may be fixed in a future release.

但是,我们可以通过猴子修补底层函数 rvest::submit_request 来解决这个问题.

However, we can work around the issue by monkey patching the underlying function rvest:::submit_request.

核心问题是辅助函数is_submit.最初,它是这样定义的:

The core problem is the helper function is_submit. Initially, it's defined like this:

is_submit <- function(x) tolower(x$type) %in% c("submit", 
        "image", "button")

虽然这很合乎逻辑,但它在两种情况下都失败了:

As logical as this is, however, it fails in two scenarios:

  1. 没有 type 元素.
  2. type 元素是 NULL.
  1. There is no type element.
  2. The type element is NULL.

两者都发生在美联航登录表单上.我们可以通过在函数内部添加两个检查来解决这个问题.

Both of these happen to occur on the United login form. We can resolve this by adding two checks inside the function.

custom.submit_request <- function (form, submit = NULL) 
{
  is_submit <- function(x) {
    if (!exists("type", x) | is.null(x$type)){
      return(F);
    }
    tolower(x$type) %in% c("submit", "image", "button")
  } 
  submits <- Filter(is_submit, form$fields)
  if (length(submits) == 0) {
    stop("Could not find possible submission target.", call. = FALSE)
  }
  if (is.null(submit)) {
    submit <- names(submits)[[1]]
    message("Submitting with '", submit, "'")
  }
  if (!(submit %in% names(submits))) {
    stop("Unknown submission name '", submit, "'.
", "Possible values: ", 
         paste0(names(submits), collapse = ", "), call. = FALSE)
  }
  other_submits <- setdiff(names(submits), submit)
  method <- form$method
  if (!(method %in% c("POST", "GET"))) {
    warning("Invalid method (", method, "), defaulting to GET", 
            call. = FALSE)
    method <- "GET"
  }
  url <- form$url
  fields <- form$fields
  fields <- Filter(function(x) length(x$value) > 0, fields)
  fields <- fields[setdiff(names(fields), other_submits)]
  values <- pluck(fields, "value")
  names(values) <- names(fields)
  list(method = method, encode = form$enctype, url = url, values = values)
}

要进行猴子补丁,我们需要使用 R.utils 包(如果没有,请通过 install.packages("R.utils") 安装)).

To monkey patch, we need to use the R.utils package (install via install.packages("R.utils") if you don't have it).

library(R.utils)

reassignInPackage("submit_request", "rvest", custom.submit_request)

从那里,我们可以发出我们自己的请求.

From there, we can issue our own request.

account <- account %>% 
     submit_form(login, "ctl00$ContentInfo$SignInSecure")

这行得通!

(好吧,有效"用词不当.由于美联航采用了更激进的身份验证要求——包括已知的浏览器——这会导致 301 Unauthorized.但是,它修复了错误).

(Well, "works" is a misnomer. Due to United employing more aggressive authentication requirements -- including known browsers -- this results in a 301 Unauthorized. However, it fixes the error).

一个完整的可重现示例涉及一些其他小的代码更改:

A full reproducible example involved a couple of other minor code changes:

library(magrittr)
library(rvest)

url <- "https://www.united.com/web/en-US/apps/account/account.aspx"
account <- html_session(url)
login <- account %>%
  html_nodes("form") %>%
  extract2(1) %>%
  html_form() %>%
  set_values(
    `ctl00$ContentInfo$SignIn$onepass$txtField` = "USER",
    `ctl00$ContentInfo$SignIn$password$txtPassword` = "PASS")
account <- account %>% 
  submit_form(login, "ctl00$ContentInfo$SignInSecure")

这篇关于如何在没有按钮参数的 Rvest 包中提交登录表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆