无法使用R从XML中提取数据 [英] Can't Extract Data from XML using R

查看:161
本文介绍了无法使用R从XML中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我知道这个话题已经在这里被广泛讨论了。我在同一件事上发现了很多问题,但仍然无法弄清楚如何解析这个XML文件。我正在使用 R ,我想从文件中拉出经度和纬度。

So I know this topic has been discussed extensively on here. I've found quite a few questions on the same thing but still can't figure out how to parse this XML file. I'm using R and I want to pull the longitude and latitude from the file.

我正在使用数据本指南,但不能似乎使它工作。

I'm using this data and this guide but can't seem to make it work.

以下是我所做的:

require(XML)  
data <- xmlParse("http://www.donatingplasma.org/index.php?option=com_storelocator&format=feed&searchall=1&Itemid=166&catid=-1&tagid=-1&featstate=0")
xml_data <- xmlToList(data)



<这一切都很好。 XML文件现在是一个大列表。当我尝试提取纬度和经度时,我迷失了。我试过:

That all works fine. The XML file is now a "large list." When I try to extract the latitude and longitude, I'm lost. I tried:

location <- as.list(xml_data[["marker"]][["lat"]])

并找到一列1行。

我如何从这个XML数据中拉出纬度和经度?

How would I go about pulling the latitude and longitude from this XML data?

数据结构示例

<markers>
<limited>0</limited>
<marker>
<name>ADMA BioCenters</name>
<category>IQPP Certified</category>
<markertype>
/media/com_storelocator/markers/100713214004000000jl_marker2.png
</markertype>
<featured>false</featured>
<address>
6290 Jimmy Carter Boulevard, Suite 208, Norcross, Georgia 30071
</address>
<lat>33.9290629</lat>
<lng>-84.2204952</lng>
<distance>0</distance>
<fulladdress>
<![CDATA[
<p><img style="margin-left: auto; margin-right: auto;" src="images/jl_marker2.png" alt="jl marker2" width="22" height="22" />IQPP Certified</p>
]]>
</fulladdress>
<phone>678-495-5800</phone>
<url>http://www.atlantaplasma.com</url>
<email/>
<facebook/>
<twitter/>
<tags>
<![CDATA[ ]]>
</tags>
<custom1 name="Custom Field 1">
<![CDATA[ ]]>
</custom1>
<custom2 name="Custom Field 2">
<![CDATA[ ]]>
</custom2>
<custom3 name="Custom Field 3">
<![CDATA[ ]]>
</custom3>
<custom4 name="Custom Field 4">
<![CDATA[ ]]>
</custom4>
<custom5 name="Custom Field 5">
<![CDATA[ ]]>
</custom5>


推荐答案

使用 xpathSapply 原始的XML,而不是通过列表。

Use xpathSapply on the original XML rather than going through the list.

lat <- xpathSApply(data, '//marker/lat', xmlValue)
long <- xpathSApply(data, '//marker/lng', xmlValue)

结果:

> head(cbind(lat, long))
     lat          long         
[1,] "33.9290629" "-84.2204952"
[2,] "48.3097292" "14.299297"  
[3,] "41.6134569" "-87.514584" 
[4,] "41.5878273" "-87.3369907"
[5,] "39.98504"   "-83.004705" 
[6,] "43.2056277" "-86.2708023"

根据@Martin Morgan的评论,我认为对不同战略进行基准测试这里:

Based on @Martin Morgan's comment, I thought it would be good to benchmark different strategies here:

> microbenchmark(xpathSApply(data, '//marker/lat', xmlValue),
                 sapply(data["//marker/lat"], xmlValue),
                 sapply(data["//marker/lat"], as, "numeric"))
Unit: milliseconds
                                        expr       min        lq   median       uq      max neval
 xpathSApply(data, "//marker/lat", xmlValue)  67.03714  97.57796 100.1633 102.1815 213.3031   100
      sapply(data["//marker/lat"], xmlValue)  72.73847 103.63095 106.1037 108.2251 132.6314   100
 sapply(data["//marker/lat"], as, "numeric") 257.16364 346.13708 389.3025 394.3669 598.3736   100

似乎

显然,最后一个策略是最不有效的(这是有道理的,因为它是在每个节点上调用类型转换,但这不是一个完全公平的测试,因为最后一个表达式产生数字输出,而前两个输出字符输出,因此第二个测试:

Clearly, the last strategy is least efficient (which makes sense because it's invoking type conversion on each node. But that makes it not a completely fair test since the last expression yields numeric output while the first two yield character output. Thus a second tests:

> microbenchmark(as.numeric(xpathSApply(data, '//marker/lat', xmlValue)), 
                 as.numeric(sapply(data["//marker/lat"], xmlValue)), 
                 sapply(data["//marker/lat"], as, "numeric"))
Unit: milliseconds
                                                    expr       min        lq    median       uq      max neval
 as.numeric(xpathSApply(data, "//marker/lat", xmlValue))  60.29744  80.08186  97.94924 100.9548 189.0797   100
      as.numeric(sapply(data["//marker/lat"], xmlValue))  59.45891  85.47169 103.68015 106.5882 124.5708   100
             sapply(data["//marker/lat"], as, "numeric") 210.92816 339.54831 384.28481 392.0001 481.4498   100

再次,使用 xpathSApply sapply (使用xpath提取)产生非常相似的结果。所以马丁的第一个解决方案的修改版本:

Again, using either xpathSApply or sapply (with an xpath extraction) yield really similar results. So a modified version of Martin's first solution:

lat <- as.numeric(sapply(data["//marker/lat"], xmlValue))

可能是这里最好的策略。

may be the best strategy here.

这篇关于无法使用R从XML中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆