来自 rvest 的 html_form 无法识别表单 [英] html_form from rvest doesn't recognise form

查看:41
本文介绍了来自 rvest 的 html_form 无法识别表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图抓取本网站的内容rvest(不是链接的论文/摘要,只是编号、标题、作者等).

I am trying to scrape the content of this website with rvest (not the linked papers/abstracts, just the number, title, authors, etc.).

默认情况下,该页面仅显示 2016 年的论文,抓取 2016 年的数据没问题".我希望 URL 在将2016"更改为所有年份"后会更改,但它保持不变.所以我求助于html_form.在检查网页的资源"时,我发现相关的输入名称是 filteryear.

Per default, the page displays 2016 papers only and scraping the 2016 data was 'no problem'. I was hoping the URL would change after changing "2016" to "all years", but it remains the same. So I resorted to html_form. Upon inspecting "resources" of the webpage, I found the relevant input name to be filteryear.

R 代码:

library(rvest)
rdc <- html_session("https://sfb649.wiwi.hu-berlin.de/fedc/discussionPapers_formular_content.php")
form <- html_form(rdc)
form <- set_values(form, filteryear = "all years")
#Error: Unknown field names: filteryear

很明显,filteryear 不是表单的一部分.由于我拥有有限的 HTML 知识,我很确定以下内容告诉我,表单包含三个输入:filterTypeNamefilterNamefilteryear.

So apparently, filteryear is not part of the form. With the limited HTML-knowledge I have, I am pretty sure the below tells me, that the form consists of three inputs: filterTypeName, filterName and filteryear.

来自资源的 HTML:

<form action='discussionPapers_formular_content.php' method='post'>
  <select name='filterTypeName'>
    <option value='AUTHORS'>Author</option>
    <option value='PROJECT'>Project Code</option>
    ...
    <option value='JEL'>JEL</option
  </select> </td>                            # Is this </td> the problem?!
  <td valign='baseline'> 
    <input type='text' size='35' name='filterName' >
  </td> 
  <td valign='baseline'>
    <select name='filteryear'>
       <option value='2005'>2005</option>
       ...
       <option value='2016'>2016</option>
       <option value='all'>all years</option>
    </select>
  </td>                                    
  <td valign='baseline'>
     &nbsp;&nbsp;<INPUT type='submit' value='Search' name='B1'></INPUT>
  </td></tr>                                   
</form>

为什么 html_form 不能完全识别这个表单?而且,更重要的是,有没有办法解决这个问题?

Why is html_form not recognising this form completely? And, more importantly, is there a way to solve this problem?

推荐答案

我想带 html_form 工作,但你可以简单地手动 httr::POST 表单如下:

I count bring html_form to work but you can simply httr::POST the form manually as follows:

library(rvest)
library(httr)
res <- POST("https://sfb649.wiwi.hu-berlin.de/fedc/discussionPapers_formular_content.php",
     body = list(filterTypeName = "filterTypeName:AUTHORS",
                 filteryear = "all",
                 B1 = "Search"), encode = "form")
out <- read_html(res) %>% html_table(fill=TRUE)

我猜第 7 个表是你想要的:

I guess the 7th table is the one you want:

> dim(out[[7]])
[1] 805  10

> head(out[[7]])

        X1                                                                                     X2
1 2016-049                                                                              Q3-D3-LSA
2 2016-048                                     Unraveling of Cooperation in Dynamic Collaboration
3 2016-047                                                            Time Varying Quantile Lasso
4 2016-046                                                           Credit Rating Score Analysis
5 2016-045                                          Information Acquisition and Liquidity Dry-Ups
6 2016-044 Dynamic Contracting with Long-Term Consequences: Optimal CEO Compensation and Turnover
                                                                  X3  X4         X5                       X6
1                               Lukas Borke and   Wolfgang K. Härdle  B1 15.11.2016        C87,   C88,   G17
2                                                        Suvi Vasama  A8 07.11.2016        C73,   D83,   O31
3      Lenka Zbonakova,  Wolfgang Karl H\177ardle and   Weining Wang  B1 07.11.2016 C21,   G01,   G20,   G32
4 Wolfgang Karl H\177ärdle,  Phoon Kok Fai and   David Lee Kuo Chuen  B1 02.11.2016 C01,   G00,   G17,   G24
5                                 Philipp Koenig and   David Pothier C10 26.10.2016        D82,   G01,   G12
6                                                        Suvi Vasama  A8 26.10.2016        C73,   D82,   D86
  X7 X8 X9 X10
1 NA NA NA  NA
2 NA NA NA  NA
3 NA NA NA  NA
4 NA NA NA  NA
5 NA NA NA  NA
6 NA NA NA  NA

这篇关于来自 rvest 的 html_form 无法识别表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆