无法使用R从XML中提取数据 [英] Can't Extract Data from XML using R
问题描述
所以我知道这个话题已经在这里被广泛讨论了。我在同一件事上发现了很多问题,但仍然无法弄清楚如何解析这个XML文件。我正在使用 R ,我想从文件中拉出经度和纬度。
So I know this topic has been discussed extensively on here. I've found quite a few questions on the same thing but still can't figure out how to parse this XML file. I'm using R and I want to pull the longitude and latitude from the file.
I'm using this data and this guide but can't seem to make it work.
以下是我所做的:
require(XML)
data <- xmlParse("http://www.donatingplasma.org/index.php?option=com_storelocator&format=feed&searchall=1&Itemid=166&catid=-1&tagid=-1&featstate=0")
xml_data <- xmlToList(data)
<这一切都很好。 XML文件现在是一个大列表。当我尝试提取纬度和经度时,我迷失了。我试过:
That all works fine. The XML file is now a "large list." When I try to extract the latitude and longitude, I'm lost. I tried:
location <- as.list(xml_data[["marker"]][["lat"]])
并找到一列1行。
我如何从这个XML数据中拉出纬度和经度?
How would I go about pulling the latitude and longitude from this XML data?
数据结构示例
<markers>
<limited>0</limited>
<marker>
<name>ADMA BioCenters</name>
<category>IQPP Certified</category>
<markertype>
/media/com_storelocator/markers/100713214004000000jl_marker2.png
</markertype>
<featured>false</featured>
<address>
6290 Jimmy Carter Boulevard, Suite 208, Norcross, Georgia 30071
</address>
<lat>33.9290629</lat>
<lng>-84.2204952</lng>
<distance>0</distance>
<fulladdress>
<![CDATA[
<p><img style="margin-left: auto; margin-right: auto;" src="images/jl_marker2.png" alt="jl marker2" width="22" height="22" />IQPP Certified</p>
]]>
</fulladdress>
<phone>678-495-5800</phone>
<url>http://www.atlantaplasma.com</url>
<email/>
<facebook/>
<twitter/>
<tags>
<![CDATA[ ]]>
</tags>
<custom1 name="Custom Field 1">
<![CDATA[ ]]>
</custom1>
<custom2 name="Custom Field 2">
<![CDATA[ ]]>
</custom2>
<custom3 name="Custom Field 3">
<![CDATA[ ]]>
</custom3>
<custom4 name="Custom Field 4">
<![CDATA[ ]]>
</custom4>
<custom5 name="Custom Field 5">
<![CDATA[ ]]>
</custom5>
推荐答案
使用 xpathSapply
原始的XML,而不是通过列表。
Use xpathSapply
on the original XML rather than going through the list.
lat <- xpathSApply(data, '//marker/lat', xmlValue)
long <- xpathSApply(data, '//marker/lng', xmlValue)
结果:
> head(cbind(lat, long))
lat long
[1,] "33.9290629" "-84.2204952"
[2,] "48.3097292" "14.299297"
[3,] "41.6134569" "-87.514584"
[4,] "41.5878273" "-87.3369907"
[5,] "39.98504" "-83.004705"
[6,] "43.2056277" "-86.2708023"
根据@Martin Morgan的评论,我认为对不同战略进行基准测试这里:
Based on @Martin Morgan's comment, I thought it would be good to benchmark different strategies here:
> microbenchmark(xpathSApply(data, '//marker/lat', xmlValue),
sapply(data["//marker/lat"], xmlValue),
sapply(data["//marker/lat"], as, "numeric"))
Unit: milliseconds
expr min lq median uq max neval
xpathSApply(data, "//marker/lat", xmlValue) 67.03714 97.57796 100.1633 102.1815 213.3031 100
sapply(data["//marker/lat"], xmlValue) 72.73847 103.63095 106.1037 108.2251 132.6314 100
sapply(data["//marker/lat"], as, "numeric") 257.16364 346.13708 389.3025 394.3669 598.3736 100
似乎
显然,最后一个策略是最不有效的(这是有道理的,因为它是在每个节点上调用类型转换,但这不是一个完全公平的测试,因为最后一个表达式产生数字输出,而前两个输出字符输出,因此第二个测试:
Clearly, the last strategy is least efficient (which makes sense because it's invoking type conversion on each node. But that makes it not a completely fair test since the last expression yields numeric output while the first two yield character output. Thus a second tests:
> microbenchmark(as.numeric(xpathSApply(data, '//marker/lat', xmlValue)),
as.numeric(sapply(data["//marker/lat"], xmlValue)),
sapply(data["//marker/lat"], as, "numeric"))
Unit: milliseconds
expr min lq median uq max neval
as.numeric(xpathSApply(data, "//marker/lat", xmlValue)) 60.29744 80.08186 97.94924 100.9548 189.0797 100
as.numeric(sapply(data["//marker/lat"], xmlValue)) 59.45891 85.47169 103.68015 106.5882 124.5708 100
sapply(data["//marker/lat"], as, "numeric") 210.92816 339.54831 384.28481 392.0001 481.4498 100
再次,使用 xpathSApply
或 sapply
(使用xpath提取)产生非常相似的结果。所以马丁的第一个解决方案的修改版本:
Again, using either xpathSApply
or sapply
(with an xpath extraction) yield really similar results. So a modified version of Martin's first solution:
lat <- as.numeric(sapply(data["//marker/lat"], xmlValue))
可能是这里最好的策略。
may be the best strategy here.
这篇关于无法使用R从XML中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!