通过 R 中的下拉列表收集数据 [英] harvesting data via drop down list in R

查看:28
本文介绍了通过 R 中的下拉列表收集数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从该网站收集数据

I am trying to harvest data from this website

http://www.lkcr.cz/seznam-lekaru-426.html(是捷克语)

我需要经历Okres"(地区)和Obor"(专业化)的所有可能组合

I need to go through every possible combination of "Okres"(region) and "Obor"(specialization)

我试过rvest,但是好像没有发现有下拉列表,html_form返回长度为0的列表.

I tried rvest, but it does not seem to find that there is any dropdown list, html_form returns list of length 0.

因此,由于我还是 R 的新手,我如何要求"网页向我显示新的页面组合?

therefore, as I am still a newbie in R, how can I "ask" the webpage to show me new combination of pages?

谢谢

JH

推荐答案

我会使用以下内容:

library(rvest)
library(dplyr)
library(tidyr)

pg <- read_html("http://www.lkcr.cz/seznam-lekaru-426.html")

obor <- html_nodes(pg, "select[name='filterObor'] > option")
obor_df <- data_frame(
  value=xml_attr(obor, "value"),
  option=xml_text(obor)
)

glimpse(obor_df)
## Observations: 115
## Variables: 2
## $ value  <chr> "", "16", "107", "17", "1", "19", "20", "21", "22", "29...
## $ option <chr> "", "alergologie a klinická imunologie", "algeziologie"...
okres <- html_nodes(pg, "select[name='filterOkresId'] > option")
okres_df <- data_frame(
  value=xml_attr(okres, "value"),
  option=xml_text(okres)
)

glimpse(okres_df)
## Observations: 78
## Variables: 2
## $ value  <chr> "", "3201", "3202", "3701", "3702", "3703", "3801", "37...
## $ option <chr> "", "Benešov", "Beroun", "Blansko", "Brno-město", "Brno...

以防字段顺序发生变化(此外,最好熟悉使用 CSS 选择器和 XPath 选择器定位节点).

in case field order ever changes (plus it's good to get familiar with targeting nodes with CSS selectors and XPath selectors).

您仍然需要遍历每一对(您可以使用嵌套的 purrr::map 调用来做到这一点;我个人可能不使用 expand.grid 或 <代码>tidyr::complete 用于此).

You still need to iterate over each pair (you can do that with nested purrr::map calls; I personally prbly wldn't use expand.grid or tidyr::complete for this).

但是……

您将在使用 rvest 提交表单时遇到问题,因为该站点在提交之前使用 javacript 进行一些数据处理.

You're going to have issues submitting the form with rvest since the site uses javacript to do some data processing before submitting.

您应该使用 Chrome 并打开开发人员工具以查看实际提交的字段,然后切换到使用 httr::POST.如果您对此有疑问,则应该在 SO 上提出一个新问题.

You should use Chrome and open up Developer Tools to see what actually gets submitted field-wise and prbly switch to using httr::POST. If you have trouble with that, you should open up a new question on SO.

这篇关于通过 R 中的下拉列表收集数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆