R-用RVest进行Web抓取 [英] R - form web scraping with rvest

查看:94
本文介绍了R-用RVest进行Web抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我想花一点时间,感谢SO社区, 您过去曾为我提供过很多次帮助,甚至不需要开设帐户.

First I'd like to take a moment and thank the SO community, You helped me many times in the past without me needing to even create an account.

我当前的问题涉及使用R进行网页抓取.不是我的强项.

My current problem involves web scraping with R. Not my strong point.

我想取消 http://www.cbs.dtu.dk/services /SignalP/

我尝试过的事情:

    library(rvest)
    url <- "http://www.cbs.dtu.dk/services/SignalP/"
    seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"

    session <- rvest::html_session(url)
    form <- rvest::html_form(session)[[2]]
    form <- rvest::set_values(form, `SEQPASTE` = seq)
    form_res_cbs <- rvest::submit_form(session, form)
    #rvest prints out:
    Submitting with 'trunc'

rvest::html_text(rvest::html_nodes(form_res_cbs, "head")) 
#ouput:
"Configuration error"

rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))

#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "

我不确定什么是未处理的参数. 问题在提交"按钮上吗?我似乎无法强迫:

I am unsure what is the unhandled parameter. Is the problem in the submit button? I can not seem to force:

form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc

submit $ name为NULL的问题吗?

is the problem the submit$name is NULL?

form[["fields"]][[23]] 

我尝试按照此处的建议定义伪造的提交按钮: 在RVest中没有提交按钮的提交表单

I tried defining the fake submit button as suggested here: Submit form with no submit button in rvest

没有运气.

我愿意使用rvest或RCurl/httr解决方案,我想避免使用RSelenium

I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium

感谢hrbrmstr出色的回答,我得以为该任务构建一个函数.在ragp包中可用: https://github.com/missuse/ragp

thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp

推荐答案

这是可行的.但是,这将需要肘部润滑脂.

Well, this is doable. But it's going to require elbow grease.

此部分:

library(rvest)
library(httr)
library(tidyverse)

POST(
  url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
  encode = "form",
  body=list(
    `configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
    `SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
    `orgtype` = "euk",
    `Dcut-type` = "default",
    `Dcut-noTM` = "0.45",
    `Dcut-TM` = "0.50",
    `graphmode` = "png",
    `format` = "summary",
    `minlen` = "",
    `method` = "best",
    `trunc` = ""
  ),
  verbose()
) -> res

提出您的请求.我留了verbose(),这样您可以看到会发生什么.它缺少文件名"字段,但是您指定了字符串,因此很好地模仿了您所做的事情.

Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.

现在,最棘手的部分是它使用中间重定向页面,使您有机会在查询完成后输入电子邮件地址进行通知.它会进行常规检查(每10秒钟左右一次),以查看查询是否完成,如果查询完成,它将快速重定向.

Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.

该页面的查询ID可以通过以下方式提取:

That page has the query id which can be extracted via:

content(res, as="parsed") %>% 
  html_nodes("input[name='jobid']") %>% 
  html_attr("value") -> jobid

现在,我们可以模仿最终请求,但是在执行此操作之前,我会先添加一个Sys.sleep(20)以确保报告已完成.

Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.

GET(
  url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
  query = list(
    jobid = jobid,
    wait = "20"
  ),
  verbose()
) -> res2

这将获得最终结果页面:

That grabs the final results page:

html_print(HTML(content(res2, as="text")))

您可以看到图像丢失,因为GET仅检索HTML内容.您可以使用rvest/xml2中的函数来解析页面,并抓取表格和URL,然后将其用于获取新内容.

You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.

为此,我使用了 burpsuite 来拦截浏览器会话,然后拦截我的 burrp R包以检查结果.您还可以目视检查burpsuite,并手动进行构建.

To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

这篇关于R-用RVest进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆