从 R 中的 HTML 选择/选项标签中抓取值 [英] Scrape values from HTML select/option tags in R

查看:47
本文介绍了从 R 中的 HTML 选择/选项标签中抓取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试(相当不成功)使用 R 从网站 (www.majidata.co.ke) 抓取一些数据.我已经设法抓取 HTML 并解析它,但现在有点不确定如何提取我真正需要的位!

使用 XML 库,我使用以下代码抓取数据:

majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")majidata_html <- htmlTreeParse(content(majidata_get, as="text"))

这给我留下了(大)XMLDocumentContent.网页上有一个下拉列表,我想从中抓取值(与不同城镇的名称和 ID 号相关).我想提取的位是 <option value ="XXX"> 和它后面的大写字母之间的数字.

<div id="town_data"><select id="town" name="town" onchange="town_data(this.value);"><option value="0" selected="selected">[SELECT TOWN]</option><option value="611">AHERO</option><option value="635">AKALA</option><option value="625">AWASI</option><option value="628">AWENDO</option><option value="749">BAHATI</option><option value="327">BANGALE</option>

理想情况下,我希望将这些放在 data.frame 中,其中第一列是数字,第二列是名称,例如

ID 名称第611章第635章第625章

我真的不知道从这里去哪里.我曾想过使用正则表达式并匹配文本中的模式,尽管我从许多论坛中读到这是一个坏主意,因为使用 xpath 更好/更有效.除了认为我需要以某种方式使用 xpathApply 之外,我不确定从哪里开始.

解决方案

全新的 rvest 包使得快速完成这项工作,让您也可以使用合理的 CSS 选择器.

更新合并第二个请求(见下面的评论)

库(rvest)图书馆(dplyr)# 从第二个弹出窗口中获取数据# 返回一个town_id、town_name、area_id、area_name的数据框addArea <- 函数(town_id,town_name){# 创建 AJAX URL 并抓取数据url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",城镇 ID)子单元 <- html(url)# 用城镇数据重新格式化为数据框数据框架(town_id=town_id,城镇名称=城镇名称,area_id=subunits %>% html_nodes("option") %>% html_attr("value"),area_name=subunits %>% html_nodes("option") %>% html_text(),stringsAsFactors=FALSE)[-1,]}# 从第一个弹出窗口中获取数据并将其放入一个框架中majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),town_name=majidata %>% html_nodes("#town option") %>% html_text(),stringsAsFactors=FALSE)[-1,]# 将 name 和 id 传递给我们的 addArea 函数,并将结果转化为# 包含所有数据(城镇和地区)的数据框组合 <- do.call("rbind.data.frame",mapply(addArea, maji$town_id, maji$town_name,SIMPLIFY=FALSE,USE.NAMES=FALSE))# 行名不是很重要,但让我们保持整洁行名(组合)<- NULLstr(组合)## 'data.frame':1964 年观察.共 4 个变量:## $town_id : chr "611" "635" "625" "628" ...## $town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ...## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ...## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ...头(组合)##town_id town_name area_id area_name## 1 611 阿赫罗 60603030101 阿赫罗## 2 635 阿卡拉 60107050201 阿卡拉## 3 625 AWASI 60603020101 AWASI## 4 628 阿文多 61103040101 阿宁多## 5 628 AWENDO 61103050401 SARE## 6 749 巴哈蒂 73101010101 巴哈蒂

I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need!

Using the XML library I scrape my data using this code:

majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
majidata_html <- htmlTreeParse(content(majidata_get, as="text"))

This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The bits I want to extract are the numbers between <option value ="XXX"> and the name following it in capital letters.

<div class="regiondata">
       <div id="town_data">
        <select id="town" name="town" onchange="town_data(this.value);">
         <option value="0" selected="selected">[SELECT TOWN]</option>
         <option value="611">AHERO</option>
         <option value="635">AKALA</option>
         <option value="625">AWASI</option>
         <option value="628">AWENDO</option>
         <option value="749">BAHATI</option>
         <option value="327">BANGALE</option>

Ideally, I'd like to have these in a data.frame where the first column is the number and second column is the name e.g.

ID       Name
611      AHERO
635      AKALA
625      AWASI

etc.

I'm not really sure where to go from here. I had thought to use regex and match the pattern within the text, though I've read from a number of forums that this is a bad idea an that its better/more efficient to use the xpath. Not really sure where to start with this though other than thinking I need to use xpathApplysomehow.

解决方案

The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.

UPDATED Incorporates the second request (see comments below)

library(rvest)
library(dplyr)

# gets data from the second popup
# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {

  # make the AJAX URL and grab the data
  url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
                 town_id)
  subunits <- html(url)

  # reformat into a data frame with the town data
  data.frame(town_id=town_id,
             town_name=town_name,
             area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
             area_name=subunits %>% html_nodes("option") %>% html_text(),
             stringsAsFactors=FALSE)[-1,]

}

# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
                   town_name=majidata %>% html_nodes("#town option") %>% html_text(),
                   stringsAsFactors=FALSE)[-1,]

# pass in the name and id to our addArea function and make the result into
# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
                    mapply(addArea, maji$town_id,  maji$town_name,
                           SIMPLIFY=FALSE, USE.NAMES=FALSE))

# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL

str(combined)

## 'data.frame':    1964 obs. of  4 variables:
##  $ town_id  : chr  "611" "635" "625" "628" ...
##  $ town_name: chr  "AHERO" "AKALA" "AWASI" "AWENDO" ...
##  $ area_id  : chr  "60603030101" "60107050201" "60603020101" "61103040101" ...
##  $ area_name: chr  "AHERO" "AKALA" "AWASI" "ANINDO" ...


head(combined)

##   town_id town_name     area_id area_name
## 1     611     AHERO 60603030101     AHERO
## 2     635     AKALA 60107050201     AKALA
## 3     625     AWASI 60603020101     AWASI
## 4     628    AWENDO 61103040101    ANINDO
## 5     628    AWENDO 61103050401      SARE
## 6     749    BAHATI 73101010101    BAHATI

这篇关于从 R 中的 HTML 选择/选项标签中抓取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆