在Python中将XML转换为CSV [英] XML to CSV in Python

查看:389
本文介绍了在Python中将XML转换为CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python中将XML文件转换为CSV时遇到很多麻烦。我看过很多论坛,尝试了lxml和xmlutils.xml2csv,但我不能让它工作。这是来自Garmin GPS设备的GPS数据。

I'm having a lot of trouble converting an XML file to a CSV in Python. I've looked at many forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. It's GPS data from a Garmin GPS device.

这是我的XML文件的样子,缩短当然是:

Here's what my XML file looks like, shortened of course:

<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
  <trk>
      <name>2013-12-03T21:08:56Z</name>
      <trkseg>
          <trkpt lat="45.4852855" lon="-122.6347885">
              <ele>0.0000000</ele>
              <time>2013-12-03T21:08:56Z</time>
          </trkpt>
          <trkpt lat="45.4852961" lon="-122.6347926">
              <ele>0.0000000</ele>
              <time>2013-12-03T21:09:00Z</time>
          </trkpt>
          <trkpt lat="45.4852982" lon="-122.6347897">
              <ele>0.2000000</ele>
              <time>2013-12-03T21:09:01Z</time>
          </trkpt>
      </trkseg>
  </trk>
</gpx>

在我的大量XML文件中有几个trk标签,但是我可以把它们分开 - 它们表示GPS设备上的不同段或行程。所有我想要的是一个CSV文件,如下所示:

There are several trk tags in my massive XML file, but I can manage to separate them out -- they represent different "segments" or trips on the GPS device. All I want is a CSV file that plots something like this:

LAT         LON         TIME         ELE
45.4...     -122.6...   2013-12...   0.00...
...         ...         ...          ...

这里是我到目前为止的代码:

Here's the code I have so far:

## Call libraries
import csv
from xmlutils.xml2csv import xml2csv

inputs = "myfile.xml"
output = "myfile.csv"

converter = xml2csv(inputs, output)
converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV")

码。它只输出一个没有数据的CSV文件,只是头文件 lat lon

This is another alternative code. It merely outputs a CSV file with no data, just the headers lat and lon.

import csv
import lxml.etree

x = '''
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
  <name>2013-12-03T21:08:56Z</name>
  <trkseg>
    <trkpt lat="45.4852855" lon="-122.6347885">
      <ele>0.0000000</ele>
      <time>2013-12-03T21:08:56Z</time>
    </trkpt>
    <trkpt lat="45.4852961" lon="-122.6347926">
      <ele>0.0000000</ele>
      <time>2013-12-03T21:09:00Z</time>
    </trkpt>
    <trkpt lat="45.4852982" lon="-122.6347897">
      <ele>0.2000000</ele>
      <time>2013-12-03T21:09:01Z</time>
    </trkpt>
  </trkseg>
</trk>
</gpx>
'''

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(('lat', 'lon'))
    root = lxml.etree.fromstring(x)
    for trkpt in root.iter('trkpt'):
        row = trkpt.get('lat'), trkpt.get('lon')
        writer.writerow(row)

我该如何做?

推荐答案

这是一个命名空间 XML文档。因此,您需要使用它们各自的命名空间来寻址节点。

This is a namespaced XML document. Therefore you need to address the nodes using their respective namespaces.

文档中使用的命名空间定义在顶部:

The namespaces used in the document are defined at the top:

xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"

因此,第一个命名空间映射到 tc2 ,并且将在像< tc2:foobar /> 的元素中使用。最后一个在 xmlns 之后没有短格式的最后一个命名空间称为默认命名空间,它适用于没有明确使用命名空间的文档 - 因此也适用于您的< trkpt /> 元素。

So the first namespace is mapped to the short form tc2, and would be used in an element like <tc2:foobar/>. The last one, which doesn't have a short form after the xmlns, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt /> elements as well.

因此您需要写 root.iter('{http://www.topografix.com/GPX/1/1} trkpt')才能选择

为了获得时间和高度,您可以使用 trkpt.find() trkpt 节点下的元素,然后是 element.text 以检索这些元素的文本内容 lat lon )。此外,因为时间 ele 元素也使用默认命名空间,您必须使用<$ c $再次选择这些节点。

In order to also get time and elevation, you can use trkpt.find() to access these elements below the trkpt node, and then element.text to retrieve those elements' text content (as opposed to attributes like lat and lon). Also, because the time and ele elements also use the default namespace you'll have to use the {namespace}element syntax again to select those nodes.

因此你可以使用这样的东西:

So you could use something like this:

NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    root = lxml.etree.fromstring(x)
    for trkpt in root.iter('{%s}trkpt' % NS):
        lat = trkpt.get('lat')
        lon = trkpt.get('lon')
        ele = trkpt.find('{%s}ele' % NS).text
        time = trkpt.find('{%s}time' % NS).text

        row = lat, lon, ele, time
        writer.writerow(row)

有关XML命名空间的详细信息,请参阅命名空间部分以及有关XML命名空间的Wikipedia文章。另请参阅 GPS交换格式,了解 .gpx 格式。

For more information on XML namespaces, see the Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces. Also see GPS eXchange Format for some details on the .gpx format.

这篇关于在Python中将XML转换为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆