如何在不带按钮参数的Rvest软件包中提交登录表单 [英] How to submit login form in Rvest package w/o button argument

查看:85
本文介绍了如何在不带按钮参数的Rvest软件包中提交登录表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取需要使用html_session()和amp;进行身份验证的网页rvest软件包中的html_form(). 我发现了这个由Hadley Wickham提供,但无法根据我的情况对其进行自定义.

I am trying to scrape a web page that requires authentication using html_session() & html_form() from the rvest package. I found this e.g. provided by Hadley Wickham, but am not able to customize it to my case.

united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
         html_nodes("form") %>%
         extract2(1) %>%
         html_form() %>%
         set_values(
                `ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
                `ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>% 
submit_form(login, "ctl00$ContentInfo$SignInSecure")

在我的情况下,我找不到要在表单中设置的值,因此我试图给用户并直接传递: set_values(电子邮件",密码")

In my case, I can't find the values to set in the form, hence I am trying to give the user and pass directly: set_values("email","password")

我也不知道如何引用提交"按钮,所以我尝试了: Submit_form(帐户,登录)

I also don't know how to refer to submit button, so I tried: submit_form(account,login)

我为Submit_form函数得到的错误是: 名称(提交)[[1]]错误:下标超出范围

The error I got for the submit_form function is: Error in names(submits)[[1]] : subscript out of bounds

任何有关如何实现此目的的想法都将受到赞赏. 谢谢

Any idea on how to go about this is appreciated. Thank you

推荐答案

当前,此问题与打开的rvest软件包中的"noreferrer>问题#159 ,这会导致问题,即表单中的所有字段并非都具有type值.此购买可能会在将来的版本中修复.

Currently, this issue is the same as the open issue #159 in the rvest package, which causes issues where not all fields in a form have a type value. This buy may be fixed in a future release.

但是,我们可以通过修补基础功能rvest:::submit_request来解决此问题.

However, we can work around the issue by monkey patching the underlying function rvest:::submit_request.

核心问题是助手功能is_submit.最初,它是这样定义的:

The core problem is the helper function is_submit. Initially, it's defined like this:

is_submit <- function(x) tolower(x$type) %in% c("submit", 
        "image", "button")

尽管如此,它在两种情况下失败:

As logical as this is, however, it fails in two scenarios:

  1. 没有type元素.
  2. type元素是NULL.
  1. There is no type element.
  2. The type element is NULL.

两者都发生在美联航登录表单上.我们可以通过在函数内部添加两个检查来解决此问题.

Both of these happen to occur on the United login form. We can resolve this by adding two checks inside the function.

custom.submit_request <- function (form, submit = NULL) 
{
  is_submit <- function(x) {
    if (!exists("type", x) | is.null(x$type)){
      return(F);
    }
    tolower(x$type) %in% c("submit", "image", "button")
  } 
  submits <- Filter(is_submit, form$fields)
  if (length(submits) == 0) {
    stop("Could not find possible submission target.", call. = FALSE)
  }
  if (is.null(submit)) {
    submit <- names(submits)[[1]]
    message("Submitting with '", submit, "'")
  }
  if (!(submit %in% names(submits))) {
    stop("Unknown submission name '", submit, "'.\n", "Possible values: ", 
         paste0(names(submits), collapse = ", "), call. = FALSE)
  }
  other_submits <- setdiff(names(submits), submit)
  method <- form$method
  if (!(method %in% c("POST", "GET"))) {
    warning("Invalid method (", method, "), defaulting to GET", 
            call. = FALSE)
    method <- "GET"
  }
  url <- form$url
  fields <- form$fields
  fields <- Filter(function(x) length(x$value) > 0, fields)
  fields <- fields[setdiff(names(fields), other_submits)]
  values <- pluck(fields, "value")
  names(values) <- names(fields)
  list(method = method, encode = form$enctype, url = url, values = values)
}

要安装猴子补丁,我们需要使用R.utils软件包(如果没有,请通过install.packages("R.utils")安装).

To monkey patch, we need to use the R.utils package (install via install.packages("R.utils") if you don't have it).

library(R.utils)

reassignInPackage("submit_request", "rvest", custom.submit_request)

从那里,我们可以发出我们自己的请求.

From there, we can issue our own request.

account <- account %>% 
     submit_form(login, "ctl00$ContentInfo$SignInSecure")

那行得通!

(嗯,有效"是用词不当.由于美联航采用了更为严格的身份验证要求-包括已知的浏览器-导致出现301 Unauthorized.但是,它解决了该错误).

(Well, "works" is a misnomer. Due to United employing more aggressive authentication requirements -- including known browsers -- this results in a 301 Unauthorized. However, it fixes the error).

一个完整的可重现示例涉及几个其他次要代码更改:

A full reproducible example involved a couple of other minor code changes:

library(magrittr)
library(rvest)

url <- "https://www.united.com/web/en-US/apps/account/account.aspx"
account <- html_session(url)
login <- account %>%
  html_nodes("form") %>%
  extract2(1) %>%
  html_form() %>%
  set_values(
    `ctl00$ContentInfo$SignIn$onepass$txtField` = "USER",
    `ctl00$ContentInfo$SignIn$password$txtPassword` = "PASS")
account <- account %>% 
  submit_form(login, "ctl00$ContentInfo$SignInSecure")

这篇关于如何在不带按钮参数的Rvest软件包中提交登录表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆