如何使用 R 从 xml 页面中提取信息 [英] How can I extract info from xml page with R

查看:44
本文介绍了如何使用 R 从 xml 页面中提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此页面获取所有信息:http://ws.parlament.ch/affairs/19110758/?format=xml

I'm trying to get all the info from this page: http://ws.parlament.ch/affairs/19110758/?format=xml

首先我将文件下载到 file 中,然后用 xmlParse(file) 解析它.

First I download the file into fileand parse it then with xmlParse(file).

download.file(url = paste0(http://ws.parlament.ch/affairs/19110758/?format=xml), destfile = destfile)
file <- xmlParse(destfile[])

我现在想提取我需要的所有信息.例如标题和身份证号码.我试过这样的事情:

I now want to extract all the information I need. For example the title and the ID-number. I tried something like this:

title <- xpathSApply(file, "//h2", xmlValue)

但这给我的只是一个错误:无法找到用于签名XMLDocument"的函数saveXML"的继承方法

But this gives me only an error: unable to find an inherited method for function ‘saveXML’ for signature ‘"XMLDocument"

接下来我尝试的是:

library(plyr)

test <-ldply(xmlToList(file), function(x) { data.frame(x[!names(x)=="id"]) } )

这给了我一个带有一些信息的data.frame.但是我丢失了诸如 id 之类的信息(这是最重要的).

This gives me a data.framewith some Info. But I lose info such as id (which is most important).

我想得到一个data.frame,其中一行(每个事件只有一行)包含一个事件的所有信息,例如id``updated additionalIndexing``affairType 等.

I'd like to get a data.frame with a row (only one row per affair) containing all the Information of one affair, such as id``updated additionalIndexing``affairTypeetc.

有了这个,它就可以工作了(id 的例子):

With this, it works (example for id):

infofile <- xmlRoot(file)

nodes <-  getNodeSet(file, "//affair/id")
id <-as.numeric(lapply(nodes, function(x) xmlSApply(x, xmlValue)))

推荐答案

这将使您获得 XML:

This will get you to your XML:

library(XML)
library(RCurl)
library(httr)

srcXML <- getURL("http://ws.parlament.ch/affairs/19110758/?format=xml", 
            .opts=c(user_agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"),
              verbose()))

myXMLFile <- xmlTreeParse(substr(srcXML,4,nchar(srcXML)))

我本来只使用 httr 中的 GET() 但它似乎不能很好地传递 user-agent(我当我不在代理后面时需要对其进行测试以确保特定错误是什么).我也做了 substr() 因为前面有一堆奇怪的字符导致 xmlTreeParse() 调用出错.

I would have used just GET() from httr but it doesn't seem to pass the user-agent along well (I need to test it when I'm not behind a proxy to be sure of what the specific error is). I also did the substr() as there's a bunch of weird characters at the front that cause the xmlTreeParse() call to error out.

这篇关于如何使用 R 从 xml 页面中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆