抓取 xml 文档(嵌套的 url 结构) [英] Scraping a xml document (nested url-structure)
问题描述
关于从某个 xml 文档 (http://www.bundestag.de/xml/mdb/index.xml).
I do have a problem concerning the scraping of information from a certain xml-document (http://www.bundestag.de/xml/mdb/index.xml).
<mdbUebersicht>
<dokumentInfo>
<dokumentURL/>
<dokumentStand/>
</dokumentInfo>
<deleteRestore>
<deleteFlag>0</deleteFlag>
<deleteDate>20131202170000</deleteDate>
</deleteRestore>
<mdbs>
<mdb fraktion="Die Linke">
<mdbID status="Aktiv">1627</mdbID>
<mdbName status="Aktiv">Aken, Jan van</mdbName>
<mdbBioURL>
http://www.bundestag.de/abgeordnete18/biografien/A/aken_jan/258124
</mdbBioURL>
<mdbInfoXMLURL>
http://www.bundestag.de/xml/mdb/biografien/A/aken_jan.xml
</mdbInfoXMLURL>
<mdbInfoXMLURLMitmischen>/biografien/A/aken_jan.xml</mdbInfoXMLURLMitmischen>
<mdbLand>Hamburg</mdbLand>
<mdbFotoURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/177/265/83abda4f387877a2b5eeedbfd81e8eba/Yc/aken_jan_gross.jpg
</mdbFotoURL>
<mdbFotoGrossURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/316/475/83abda4f387877a2b5eeedbfd81e8eba/Uq/aken_jan_gross.jpg
</mdbFotoGrossURL>
<mdbFotoLastChanged>24.10.2016</mdbFotoLastChanged>
<mdbFotoChangedDateTime>24.10.2016 12:17</mdbFotoChangedDateTime>
<lastChanged>30.09.2016</lastChanged>
<changedDateTime>30.09.2016 12:38</changedDateTime>
</mdb>
该文件包含许多不同人的简短传记方面.其中,它包含指向其他 xml 文档的 url,其中包含更详细的传记.
The document contains a lot of short biographical aspects of different persons. Among other things it contains urls to other xml documents which contains a more detailed biography.
我尝试以下方法来获取信息:
I try the following to get the information:
首先我尝试从主文档中获取不同子文档的所有 URL
First I try to get all URLs for the different sub-documents from the maindocument
mdb_url <- xml_text(xml_find_all(xmlDocu, "//mdbInfoXMLURL"))
然后我实现了一个 for 循环,它下载我目录中的所有 xml
Then I implemented a for-loop which download all xml in my directory
for (url in mdb_url) {
download.file(url, destfile = basename(url))
}
之后我想收到一份文件列表...
Afterwards I want to received a list of the files...
files <- list.files(pattern = ".xml")
... 获取每个 xml 文档的特定节点:
... to get a specific node of every xml doc:
Bio1 <- files[1]
xmlfile <- read_xml(Bio1)
mdb_ausschuss1 <- xml_text(xml_find_all(xmlfile, "//gremiumName"))
现在我遇到了问题,如何对列表中的所有 xml 文件执行此操作?我一直无法为该任务编写功能循环或脚本...
Now I have the problem how I can do it for all xml files in the list? I haven't been able to write a functional loop or script for that task...
推荐答案
library(xml2)
library(httr)
library(rvest)
library(tools)
library(tidyverse)
从主站点 XML 中获取 URL 列表
Get the URL list from the main site XML
URL <- "http://www.bundestag.de/xml/mdb/index.xml"
doc <- read_xml(URL)
xml_find_all(doc, "//mdbInfoXMLURL") %>% xml_text() -> mdb_urls
创建一个存放它们的地方:
Create a place to store them:
dir.create("docs")
将它们写入磁盘(我只抓取了其中的 10 个,因为我不需要数据,您需要 :-)
Write them to disk (I’m only grabbing 10 of them since I don’t need the data, you do :-)
请注意,除非被告知,write_disk()
不会覆盖路径,因此这是进行穷人缓存的好方法.如果您将其放置在可重现的脚本中,则必须尝试/捕获包装它.
Note that write_disk()
will not overwrite the path unless told to, so this is a great way to do poor-man’s caching. If you place this in a reproducible script, you'll have to try/catch wrap it.
walk(mdb_urls[1:10], ~GET(., write_disk(file.path("docs", basename(.)))))
获取文件列表:
fils <- list.files("docs", pattern=".*.xml", full.names=TRUE)
把它变成一个数据框:
pb <- progress_estimated(length(fils)) # use a progress bar
map_df(fils, function(x) {
pb$tick()$print() # increment the progress bar
gremium_doc <- read_xml(x) # read in the file
# find all the `gremiumName`s. If there are none, make the value `NA`
xml_find_all(gremium_doc, "//gremiumName") %>% xml_text() -> g_names
if (length(g_names) == 0) g_names <- NA_character_
# make a tidy data frame
data_frame(gremium=file_path_sans_ext(basename(x)), name=g_names)
}) -> df
证明它有效
glimpse(df)
## Observations: 33
## Variables: 2
## $ gremium <chr> "aken_jan", "aken_jan", "aken_jan", "aken_jan", "alban...
## $ name <chr> "Auswärtiger Ausschuss", "Gremium nach § 23c Absatz 8 ...
这篇关于抓取 xml 文档(嵌套的 url 结构)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!