用 Python 解析 Google Earth KML 文件(lxml,命名空间) [英] Parsing Google Earth KML file in Python (lxml, namespaces)

查看:49
本文介绍了用 Python 解析 Google Earth KML 文件(lxml,命名空间)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 xml 模块将 .kml 文件解析为 Python(之后未能在 BeautifulSoup 中完成这项工作,我将其用于 HTML).

I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).

由于这是我第一次这样做,我遵循了官方教程,一切顺利直到我尝试构造一个迭代器来通过根迭代提取我的数据:

As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:

from lxml import etree
tree=etree.parse('kmlfile')

这是我试图模仿的教程中的示例:

Here is the example from the tutorial I am trying to emulate:

如果你知道你只对一个标签感兴趣,你可以将它的名字传递给 getiterator() 让它为你过滤:

If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:

for element in root.getiterator("child"):
    print element.tag, '-', element.text

我想获取地标"下的所有数据,所以我尝试了

I would like to get all data under 'Placemark', so I tried

for i in tree.getiterterator("Placemark"):
    print i, type(i)

这没有给我任何东西.有效的是:

which doesn't give me anything. What does work is:

for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
    print i, type(i)

我不明白这是怎么回事.www.opengis.net 列在文档开头的标签中 (kml xmlns="http://www.opengis.net/kml/2.2"...) ,但我不明白

I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand

  • {} 中的部分如何与我的具体示例相关

  • how the part in {} relates to my specific example at all

为什么它与教程不同

非常感谢任何帮助!

推荐答案

这是我的解决方案.因此,最重要的事情是阅读 this 发布者托马拉克.对命名空间的描述非常好,易于理解.

Here is my solution. So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.

我们将使用 XPath 来导航 XML 文档.它的表示法类似于文件系统,其中父项和子项由斜杠 / 分隔.此处 解释了语法,但请注意,lxml 实现.

We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.

###问题

我们的目标是提取城市名称:在下的的内容.这是相关的 XML:

Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:

<Placemark> <name>CITY NAME</name> 

与我上面发布的非功能代码等效的 XPath 是:

The XPath equivalent to the non-functional code I posted above is:

tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')

其中需要 text() 部分来获取位置 //Placemark/name 中包含的文本.

Where the text() part is needed to get the text contained in the location //Placemark/name.

现在这不起作用,正如 Tomalak 指出的那样,因为这两个节点的名称实际上是 {http://www.opengis.net/kml/2.2}Placemark{http://www.opengis.net/kml/2.2}name.大括号中的部分是默认命名空间.它没有出现在实际的文档中(这让我很困惑),但它在 XML 文档的开头定义如下:

Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:

xmlns="http://www.opengis.net/kml/2.2"

###解决方案

我们可以通过设置 namespaces 参数为 xpath 提供命名空间:

We can supply namespaces to xpath by setting the namespaces argument:

xpath(X, namespaces={prefix: namespace})

这对于具有实际前缀的命名空间来说很容易,在本文档中,例如 relativeToSeaFloor 其中 gx前缀在文档中定义为 xmlns:gx=http://www.google.com/kml/ext/2.2".

This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".

然而,Xpath 不了解默认命名空间是什么(参见 docs).因此,我们需要欺骗它,就像上面 Tomalak 建议的那样:我们为默认值发明一个前缀并将其添加到我们的搜索词中.例如,我们可以将其称为 kml.这段代码实际上可以解决问题:

However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:

tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})

教程 提到还有一个 ETXPath 方法,其工作方式与 Xpath 类似,只是将命名空间写在大括号中而不是在字典中定义它们.因此,输入的样式为 {http://www.opengis.net/kml/2.2}Placemark.

The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.

这篇关于用 Python 解析 Google Earth KML 文件(lxml,命名空间)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆