R:将各种 <div> 类收集到具有(子)元素的列表中 [英] R: Webscraping various <div>-classes into lists with (sub-)elements

查看：37 发布时间：2021/7/14 18:37:31 r xpath web-scraping rvest

本文介绍了R:将各种 <div> 类收集到具有(子)元素的列表中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 rvest 来抓取这个网站.它包含这样一种形式的数据(简化):


Editors
<div class="editor"><div class="editor-name"><h3>Otto Heath</h3></div><span class="editor-affiliation">Royal Holloway University of London</span>

<div class="editor"><div class="editor-name"><h3>Kathrin Smets</h3></div><span class="editor-affiliation">Royal Holloway University of London</span>

<div class="editor-type">Associate Editor</div><div class="editor"><div class="editor-name"><h3>Rosa Dassonville</h3></div><span class="editor-affiliation">蒙特利尔大学</span>

<div class="editor"><div class="editor-name"><h3>Matthias Wagner</h3></div><span class="editor-affiliation">University of Wagner</span>

图书馆(rvest")网页 <- read_html(url("https://www.journals.elsevier.com/electoral-studies/editorial-board"))editorial_types <- 网页 %>%html_nodes(xpath = "//div[@class='editor-type']")editor_names <- 网页 %>%html_nodes(xpath = "//div[@class='editor']/descendant::div[@class='editor-name']")

库(rvest)图书馆(dplyr)#阅读文档网页 <- read_html("https://www.journals.elsevier.com/electoral-studies/editorial-board")#找到父节点pubeditors <- 网页%>% html_nodes(div.publication-editors")#获取子节点editorsnodes <- html_children(pubeditors)#找到带有位置标题的节点titlesnodesnum <- which(html_attr(editorsnodes, class") ==publication-editor-type")#创建标题向量标题 <- editorsnodes[titlesnodesnum] %>% html_text() %>% trimws()#包括列表中的最后一个节点titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #标识最后一条记录#在子类别节点之间查找名称回答 <- lapply(2:length(titlesnodesnum), function(n){start<- titlesnodesnum[n-1]+1 #子类别中的起始节点end <- titlesnodesnum [n] -1 #子类别中的结束节点名称 <- editorsnodes[start:end] %>% html_nodes(div.publication-editor-name") %>% html_text() %>%trimws()})#重命名列表姓名(答案)<- 标题回答$Editors[1] 《奥利弗·希思》Kaat Smets"$`副编辑`[1] 《露丝·达松维尔》马库斯·瓦格纳"$`编辑助理`[1] 《马特·波拉克》$`编辑委员会`[1] 《伊娃·安杜莎》《保罗·贝鲁奇》阿曼达·比特纳"《安德烈·布莱斯》达米安·波尔"[6] 《肖恩·鲍勒》《巴里·伯登》大卫巴特勒"《罗茜·坎贝尔》米格尔·卡雷拉斯"[11]《哈罗德·D·克拉克》《布莱恩·克里斯普》《露丝·达松维尔》《马丁·埃尔夫》《杰弗里·埃文斯》[16]史蒂夫·费舍尔"罗伯福特"《艾娜·加勒戈》托马斯·格施温德"卡罗琳·范·哈姆"[21]克里斯·汉雷蒂"Elina Kestilä-Kekkonen"安-克里斯汀·科恩"《莫娜·克鲁威尔》《马修·莱博》[26]迈克尔·刘易斯-贝克"《伊恩·麦卡利斯特》《凯特琳·米拉佐》安德烈亚斯·穆尔"安雅·诺恩多夫"[31] 《塞尔吉·帕尔多斯》《查尔斯·帕蒂》《迈克尔·佩尔森》《斯蒂芬妮·雷尔》《杰森·赖弗勒》[36]罗伯特·罗尔施奈德"Eline de Rooij"扬·罗夫尼"《谢恩·辛格》《玛丽·斯特格迈尔》[41]劳拉·斯蒂芬森"《符文斯巴格》《尼克·维维安》《赫伯特·韦斯伯格》克里斯托弗·莱齐恩"[46] 乔治斯·泽佐纳基斯"《伊丽莎白·泽克迈斯特》《亚当·齐格菲尔德》

<div class="editor-type">Editors</div> <div class="editor"> <div class="editor-name"><h3>Otto Heath</h3></div> <span class="editor-affiliation">Royal Holloway University of London</span> </div> <div class="editor"> <div class="editor-name"><h3>Kathrin Smets</h3></div> <span class="editor-affiliation">Royal Holloway University of London</span> </div> <div class="editor-type">Associate Editor</div> <div class="editor"> <div class="editor-name"><h3>Rosa Dassonville</h3></div> <span class="editor-affiliation">University of Montreal</span> </div> <div class="editor"> <div class="editor-name"><h3>Matthias Wagner</h3></div> <span class="editor-affiliation">University of Wagner</span> </div> <div class="editor-type">Editorial Assistant</div> <div class="editor"> <div class="editor-name"><h3>Markus Polacko</h3></div> <span class="editor-affiliation">Royal Holloway University of London</span> </div>

library("rvest") webpage <- read_html(url("https://www.journals.elsevier.com/electoral-studies/editorial-board")) editorial_types <- webpage %>% html_nodes(xpath = "//div[@class='editor-type']") editor_names <- webpage %>% html_nodes(xpath = "//div[@class='editor']/descendant::div[@class='editor-name']")

library(rvest) library(dplyr) #read the document webpage <- read_html("https://www.journals.elsevier.com/electoral-studies/editorial-board") #find parent Node pubeditors <- webpage %>% html_nodes("div.publication-editors") #get the children Nodes editorsnodes <- html_children(pubeditors) #find nodes with the Position title titlesnodesnum <- which(html_attr(editorsnodes, "class") =="publication-editor-type") #create vector of title titles <- editorsnodes[titlesnodesnum] %>% html_text() %>% trimws() #include the last node in the list titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #identify the last record #find names between subcategory nodes answer <- lapply(2:length(titlesnodesnum), function(n){ start<- titlesnodesnum[n-1]+1 #starting node in subcategory end <- titlesnodesnum [n] -1 #ending node in subcategory names <- editorsnodes[start:end] %>% html_nodes("div.publication-editor-name") %>% html_text() %>% trimws() }) #rename the list names(answer) <- titles answer $Editors [1] "Oliver Heath" "Kaat Smets" $`Associate Editor` [1] "Ruth Dassonneville" "Markus Wagner" $`Editorial Assistant` [1] "Matt Polacko" $`Editorial Board` [1] "Eva Anduiza" "Paolo Bellucci" "Amanda Bittner" "Andre Blais" "Damien Bol" [6] "Shaun Bowler" "Barry Burden" "David Butler" "Rosie Campbell" "Miguel Carreras" [11] "Harold D Clarke" "Brian Crisp" "Ruth Dassonneville" "Martin Elff" "Geoffrey Evans" [16] "Steve Fisher" "Rob Ford" "Aina Gallego" "Thomas Gschwend" "Carolien van Ham" [21] "Chris Hanretty" "Elina Kestilä-Kekkonen" "Ann-Kristin Kölln" "Mona Krewel" "Matthew Lebo" [26] "Michael Lewis-Beck" "Ian McAllister" "Caitlin Milazzo" "Andreas Murr" "Anja Neundorf" [31] "Sergi Pardos" "Charles Pattie" "Mikael Persson" "Stephanie Reher" "Jason Reifler" [36] "Robert Rohrschneider" "Eline de Rooij" "Jan Rovny" "Shane Singh" "Mary Stegmaier" [41] "Laura Stephenson" "Rune Stubager" "Nick Vivyan" "Herbert Weisberg" "Christopher Wlezien" [46] "Georgios Xezonakis" "Elizabeth Zechmeister" "Adam Ziegfeld"

R:将各种 <div> 类收集到具有(子)元素的列表中 [英] R: Webscraping various <div>-classes into lists with (sub-)elements

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R:将各种 &lt;div&gt; 类收集到具有(子)元素的列表中 [英] R: Webscraping various &lt;div&gt;-classes into lists with (sub-)elements

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

R:将各种 <div> 类收集到具有(子)元素的列表中 [英] R: Webscraping various <div>-classes into lists with (sub-)elements

登录关闭