R:将各种 <div> 类收集到具有(子)元素的列表中 [英] R: Webscraping various <div>-classes into lists with (sub-)elements

查看:37
本文介绍了R:将各种 <div> 类收集到具有(子)元素的列表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 rvest 来抓取 这个网站.它包含这样一种形式的数据(简化):

Editors

<div class="editor"><div class="editor-name"><h3>Otto Heath</h3></div><span class="editor-affiliation">Royal Holloway University of London</span>

<div class="editor"><div class="editor-name"><h3>Kathrin Smets</h3></div><span class="editor-affiliation">Royal Holloway University of London</span>

<div class="editor-type">Associate Editor</div><div class="editor"><div class="editor-name"><h3>Rosa Dassonville</h3></div><span class="editor-affiliation">蒙特利尔大学</span>

<div class="editor"><div class="editor-name"><h3>Matthias Wagner</h3></div><span class="editor-affiliation">University of Wagner</span>

<div class="editor-type">编辑助理</div><div class="editor"><div class="editor-name"><h3>Markus Polacko</h3></div><span class="editor-affiliation">Royal Holloway University of London</span>

我可以轻松地将 editor-typeeditor-name 抓取到相应的列表中,例如像这样:

图书馆(rvest")网页 <- read_html(url("https://www.journals.elsevier.com/electoral-studies/editorial-board"))editorial_types <- 网页 %>%html_nodes(xpath = "//div[@class='editor-type']")editor_names <- 网页 %>%html_nodes(xpath = "//div[@class='editor']/descendant::div[@class='editor-name']")

但是,我想将它们组合成一个列表.它应该包含 editor-type 元素(编辑器、副编辑器等)和具有相应 editor-name 的子元素,可能像这样:

list_of_editors[[1]] 编辑[1] 奥托·希思[2] 凯瑟琳·斯梅茨[[2]] 副主编[1] 罗莎·达森维尔[2] 马库斯·瓦格纳[[3]] 编辑助理[1] 马库斯·波拉科

我怎样才能做到这一点?

解决方案

这有点棘手,因为它是一个直接的标题和名称列表,而不是一个分层列表.策略是找到所有节点,将包含标题的节点排序,然后从包含标题的节点之间的节点中提取名称.

库(rvest)图书馆(dplyr)#阅读文档网页 <- read_html("https://www.journals.elsevier.com/electoral-studies/editorial-board")#找到父节点pubeditors <- 网页%>% html_nodes(div.publication-editors")#获取子节点editorsnodes <- html_children(pubeditors)#找到带有位置标题的节点titlesnodesnum <- which(html_attr(editorsnodes, class") ==publication-editor-type")#创建标题向量标题 <- editorsnodes[titlesnodesnum] %>% html_text() %>% trimws()#包括列表中的最后一个节点titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #标识最后一条记录#在子类别节点之间查找名称回答 <- lapply(2:length(titlesnodesnum), function(n){start<- titlesnodesnum[n-1]+1 #子类别中的起始节点end <- titlesnodesnum [n] -1 #子类别中的结束节点名称 <- editorsnodes[start:end] %>% html_nodes(div.publication-editor-name") %>% html_text() %>%trimws()})#重命名列表姓名(答案)<- 标题回答$Editors[1] 《奥利弗·希思》Kaat Smets"$`副编辑`[1] 《露丝·达松维尔》马库斯·瓦格纳"$`编辑助理`[1] 《马特·波拉克》$`编辑委员会`[1] 《伊娃·安杜莎》《保罗·贝鲁奇》阿曼达·比特纳"《安德烈·布莱斯》达米安·波尔"[6] 《肖恩·鲍勒》《巴里·伯登》大卫巴特勒"《罗茜·坎贝尔》米格尔·卡雷拉斯"[11]《哈罗德·D·克拉克》《布莱恩·克里斯普》《露丝·达松维尔》《马丁·埃尔夫》《杰弗里·埃文斯》[16]史蒂夫·费舍尔"罗伯福特"《艾娜·加勒戈》托马斯·格施温德"卡罗琳·范·哈姆"[21]克里斯·汉雷蒂"Elina Kestilä-Kekkonen"安-克里斯汀·科恩"《莫娜·克鲁威尔》《马修·莱博》[26]迈克尔·刘易斯-贝克"《伊恩·麦卡利斯特》《凯特琳·米拉佐》安德烈亚斯·穆尔"安雅·诺恩多夫"[31] 《塞尔吉·帕尔多斯》《查尔斯·帕蒂》《迈克尔·佩尔森》《斯蒂芬妮·雷尔》《杰森·赖弗勒》[36]罗伯特·罗尔施奈德"Eline de Rooij"扬·罗夫尼"《谢恩·辛格》《玛丽·斯特格迈尔》[41]劳拉·斯蒂芬森"《符文斯巴格》《尼克·维维安》《赫伯特·韦斯伯格》克里斯托弗·莱齐恩"[46] 乔治斯·泽佐纳基斯"《伊丽莎白·泽克迈斯特》《亚当·齐格菲尔德》

I use rvest to scrape this website. It contains data in such a form (simplified):

<div class="editor-type">Editors</div>
<div class="editor">
  <div class="editor-name"><h3>Otto Heath</h3></div>
  <span class="editor-affiliation">Royal Holloway University of London</span>
</div>
<div class="editor">
  <div class="editor-name"><h3>Kathrin Smets</h3></div>
  <span class="editor-affiliation">Royal Holloway University of London</span>
</div>

<div class="editor-type">Associate Editor</div>
<div class="editor">
  <div class="editor-name"><h3>Rosa Dassonville</h3></div>
  <span class="editor-affiliation">University of Montreal</span>
</div>
<div class="editor">
  <div class="editor-name"><h3>Matthias Wagner</h3></div>
  <span class="editor-affiliation">University of Wagner</span>
</div>

<div class="editor-type">Editorial Assistant</div>
<div class="editor">
  <div class="editor-name"><h3>Markus Polacko</h3></div>
  <span class="editor-affiliation">Royal Holloway University of London</span>
</div>

I can easily scrape editor-type and editor-name into respective lists, e.g. like this:

library("rvest")
webpage <- read_html(url("https://www.journals.elsevier.com/electoral-studies/editorial-board"))
editorial_types <- webpage %>%
  html_nodes(xpath = "//div[@class='editor-type']")
editor_names <- webpage %>%
  html_nodes(xpath = "//div[@class='editor']/descendant::div[@class='editor-name']")

However, I want to combine them into a single list. It should contain elements of editor-type (Editors, Associate Editors, etc) and sub-elements with the respective editor-name, perhaps like this:

list_of_editors
[[1]] Editors
[1] Otto Heath
[2] Kathrin Smets

[[2]] Associate Editor
[1] Rosa Dassonville
[2] Markus Wagner

[[3]] Editorial Assistant
[1] Markus Polacko 

How can I achieve that?

解决方案

This was a bit tricky since it was a straight list of titles and names and not a hierarchical list. The strategy is to find all of the nodes sort out the nodes containing the title and then extract the names from the nodes between the nodes containing the titles.

library(rvest)
library(dplyr)

#read the document
webpage <- read_html("https://www.journals.elsevier.com/electoral-studies/editorial-board")

#find parent Node
pubeditors <- webpage %>% html_nodes("div.publication-editors")

#get the children Nodes
editorsnodes <- html_children(pubeditors)

#find nodes with the Position title
titlesnodesnum <- which(html_attr(editorsnodes, "class") =="publication-editor-type")
#create vector of title
titles <- editorsnodes[titlesnodesnum] %>% html_text() %>% trimws()

#include the last node in the list
titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #identify the last record

#find names between subcategory nodes
answer <- lapply(2:length(titlesnodesnum), function(n){
   start<- titlesnodesnum[n-1]+1  #starting node in subcategory
   end <- titlesnodesnum [n] -1   #ending node in subcategory
   names <- editorsnodes[start:end] %>% html_nodes("div.publication-editor-name") %>% html_text() %>% trimws()
})

#rename the list
names(answer) <- titles
answer

$Editors
[1] "Oliver Heath" "Kaat Smets"  

$`Associate Editor`
[1] "Ruth Dassonneville" "Markus Wagner"     

$`Editorial Assistant`
[1] "Matt Polacko"

$`Editorial Board`
 [1] "Eva Anduiza"            "Paolo Bellucci"         "Amanda Bittner"         "Andre Blais"            "Damien Bol"            
 [6] "Shaun Bowler"           "Barry Burden"           "David Butler"           "Rosie Campbell"         "Miguel Carreras"       
[11] "Harold D Clarke"        "Brian Crisp"            "Ruth Dassonneville"     "Martin Elff"            "Geoffrey Evans"        
[16] "Steve Fisher"           "Rob Ford"               "Aina Gallego"           "Thomas Gschwend"        "Carolien van Ham"      
[21] "Chris Hanretty"         "Elina Kestilä-Kekkonen" "Ann-Kristin Kölln"      "Mona Krewel"            "Matthew Lebo"          
[26] "Michael Lewis-Beck"     "Ian McAllister"         "Caitlin Milazzo"        "Andreas Murr"           "Anja Neundorf"         
[31] "Sergi Pardos"           "Charles Pattie"         "Mikael Persson"         "Stephanie Reher"        "Jason Reifler"         
[36] "Robert Rohrschneider"   "Eline de Rooij"         "Jan Rovny"              "Shane Singh"            "Mary Stegmaier"        
[41] "Laura Stephenson"       "Rune Stubager"          "Nick Vivyan"            "Herbert Weisberg"       "Christopher Wlezien"   
[46] "Georgios Xezonakis"     "Elizabeth Zechmeister"  "Adam Ziegfeld"   

  

这篇关于R:将各种 &lt;div&gt; 类收集到具有(子)元素的列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆