抓取 xml 文档(嵌套的 url 结构) [英] Scraping a xml document (nested url-structure)

查看:15
本文介绍了抓取 xml 文档(嵌套的 url 结构)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于从某个 xml 文档 (http://www.bundestag.de/xml/mdb/index.xml).

I do have a problem concerning the scraping of information from a certain xml-document (http://www.bundestag.de/xml/mdb/index.xml).

<mdbUebersicht>
<dokumentInfo>
<dokumentURL/>
<dokumentStand/>
</dokumentInfo>
<deleteRestore>
<deleteFlag>0</deleteFlag>
<deleteDate>20131202170000</deleteDate>
</deleteRestore>
<mdbs>
<mdb fraktion="Die Linke">
<mdbID status="Aktiv">1627</mdbID>
<mdbName status="Aktiv">Aken, Jan van</mdbName>
<mdbBioURL>
http://www.bundestag.de/abgeordnete18/biografien/A/aken_jan/258124
</mdbBioURL>
<mdbInfoXMLURL>
http://www.bundestag.de/xml/mdb/biografien/A/aken_jan.xml
</mdbInfoXMLURL>
<mdbInfoXMLURLMitmischen>/biografien/A/aken_jan.xml</mdbInfoXMLURLMitmischen>
<mdbLand>Hamburg</mdbLand>
<mdbFotoURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/177/265/83abda4f387877a2b5eeedbfd81e8eba/Yc/aken_jan_gross.jpg
</mdbFotoURL>
<mdbFotoGrossURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/316/475/83abda4f387877a2b5eeedbfd81e8eba/Uq/aken_jan_gross.jpg
</mdbFotoGrossURL>
<mdbFotoLastChanged>24.10.2016</mdbFotoLastChanged>
<mdbFotoChangedDateTime>24.10.2016 12:17</mdbFotoChangedDateTime>
<lastChanged>30.09.2016</lastChanged>
<changedDateTime>30.09.2016 12:38</changedDateTime>
</mdb>

该文件包含许多不同人的简短传记方面.其中,它包含指向其他 xml 文档的 url,其中包含更详细的传记.

The document contains a lot of short biographical aspects of different persons. Among other things it contains urls to other xml documents which contains a more detailed biography.

我尝试以下方法来获取信息:

I try the following to get the information:

首先我尝试从主文档中获取不同子文档的所有 URL

First I try to get all URLs for the different sub-documents from the maindocument

mdb_url <- xml_text(xml_find_all(xmlDocu, "//mdbInfoXMLURL"))

然后我实现了一个 for 循环,它下载我目录中的所有 xml

Then I implemented a for-loop which download all xml in my directory

for (url in mdb_url) {
  download.file(url, destfile = basename(url))
}

之后我想收到一份文件列表...

Afterwards I want to received a list of the files...

files <- list.files(pattern = ".xml")

... 获取每个 xml 文档的特定节点:

... to get a specific node of every xml doc:

Bio1 <- files[1]

xmlfile <- read_xml(Bio1)

mdb_ausschuss1 <- xml_text(xml_find_all(xmlfile, "//gremiumName"))

现在我遇到了问题,如何对列表中的所有 xml 文件执行此操作?我一直无法为该任务编写功能循环或脚本...

Now I have the problem how I can do it for all xml files in the list? I haven't been able to write a functional loop or script for that task...

推荐答案

library(xml2)
library(httr)
library(rvest)
library(tools)
library(tidyverse)

从主站点 XML 中获取 URL 列表

Get the URL list from the main site XML

URL <- "http://www.bundestag.de/xml/mdb/index.xml"
doc <- read_xml(URL)
xml_find_all(doc, "//mdbInfoXMLURL") %>% xml_text() -> mdb_urls

创建一个存放它们的地方:

Create a place to store them:

dir.create("docs")

将它们写入磁盘(我只抓取了其中的 10 个,因为我不需要数据,您需要 :-)

Write them to disk (I’m only grabbing 10 of them since I don’t need the data, you do :-)

请注意,除非被告知,write_disk() 不会覆盖路径,因此这是进行穷人缓存的好方法.如果您将其放置在可重现的脚本中,则必须尝试/捕获包装它.

Note that write_disk() will not overwrite the path unless told to, so this is a great way to do poor-man’s caching. If you place this in a reproducible script, you'll have to try/catch wrap it.

walk(mdb_urls[1:10], ~GET(., write_disk(file.path("docs", basename(.)))))

获取文件列表:

fils <- list.files("docs", pattern=".*.xml", full.names=TRUE)

把它变成一个数据框:

pb <- progress_estimated(length(fils)) # use a progress bar
map_df(fils, function(x) {

  pb$tick()$print() # increment the progress bar

  gremium_doc <- read_xml(x) # read in the file

  # find all the `gremiumName`s. If there are none, make the value `NA`
  xml_find_all(gremium_doc, "//gremiumName") %>% xml_text() -> g_names
  if (length(g_names) == 0) g_names <- NA_character_

  # make a tidy data frame
  data_frame(gremium=file_path_sans_ext(basename(x)), name=g_names)

}) -> df

证明它有效

glimpse(df)
## Observations: 33
## Variables: 2
## $ gremium <chr> "aken_jan", "aken_jan", "aken_jan", "aken_jan", "alban...
## $ name    <chr> "Auswärtiger Ausschuss", "Gremium nach § 23c Absatz 8 ...

这篇关于抓取 xml 文档(嵌套的 url 结构)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆