在Python中将XML转换为CSV [英] XML to CSV in Python
问题描述
在Python中将XML文件转换为CSV时遇到很多麻烦。我看过很多论坛,尝试了lxml和xmlutils.xml2csv,但我不能让它工作。这是来自Garmin GPS设备的GPS数据。
I'm having a lot of trouble converting an XML file to a CSV in Python. I've looked at many forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. It's GPS data from a Garmin GPS device.
这是我的XML文件的样子,缩短当然是:
Here's what my XML file looks like, shortened of course:
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
<name>2013-12-03T21:08:56Z</name>
<trkseg>
<trkpt lat="45.4852855" lon="-122.6347885">
<ele>0.0000000</ele>
<time>2013-12-03T21:08:56Z</time>
</trkpt>
<trkpt lat="45.4852961" lon="-122.6347926">
<ele>0.0000000</ele>
<time>2013-12-03T21:09:00Z</time>
</trkpt>
<trkpt lat="45.4852982" lon="-122.6347897">
<ele>0.2000000</ele>
<time>2013-12-03T21:09:01Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
在我的大量XML文件中有几个trk标签,但是我可以把它们分开 - 它们表示GPS设备上的不同段或行程。所有我想要的是一个CSV文件,如下所示:
There are several trk tags in my massive XML file, but I can manage to separate them out -- they represent different "segments" or trips on the GPS device. All I want is a CSV file that plots something like this:
LAT LON TIME ELE
45.4... -122.6... 2013-12... 0.00...
... ... ... ...
这里是我到目前为止的代码:
Here's the code I have so far:
## Call libraries
import csv
from xmlutils.xml2csv import xml2csv
inputs = "myfile.xml"
output = "myfile.csv"
converter = xml2csv(inputs, output)
converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV")
码。它只输出一个没有数据的CSV文件,只是头文件 lat
和 lon
。
This is another alternative code. It merely outputs a CSV file with no data, just the headers lat
and lon
.
import csv
import lxml.etree
x = '''
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
<name>2013-12-03T21:08:56Z</name>
<trkseg>
<trkpt lat="45.4852855" lon="-122.6347885">
<ele>0.0000000</ele>
<time>2013-12-03T21:08:56Z</time>
</trkpt>
<trkpt lat="45.4852961" lon="-122.6347926">
<ele>0.0000000</ele>
<time>2013-12-03T21:09:00Z</time>
</trkpt>
<trkpt lat="45.4852982" lon="-122.6347897">
<ele>0.2000000</ele>
<time>2013-12-03T21:09:01Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
'''
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(('lat', 'lon'))
root = lxml.etree.fromstring(x)
for trkpt in root.iter('trkpt'):
row = trkpt.get('lat'), trkpt.get('lon')
writer.writerow(row)
我该如何做?
推荐答案
这是一个命名空间 XML文档。因此,您需要使用它们各自的命名空间来寻址节点。
This is a namespaced XML document. Therefore you need to address the nodes using their respective namespaces.
文档中使用的命名空间定义在顶部:
The namespaces used in the document are defined at the top:
xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"
因此,第一个命名空间映射到 tc2
,并且将在像< tc2:foobar />
的元素中使用。最后一个在 xmlns
之后没有短格式的最后一个命名空间称为默认命名空间,它适用于没有明确使用命名空间的文档 - 因此也适用于您的< trkpt />
元素。
So the first namespace is mapped to the short form tc2
, and would be used in an element like <tc2:foobar/>
. The last one, which doesn't have a short form after the xmlns
, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt />
elements as well.
因此您需要写 root.iter('{http://www.topografix.com/GPX/1/1} trkpt')
才能选择
为了获得时间和高度,您可以使用 trkpt.find()
trkpt
节点下的元素,然后是 element.text
以检索这些元素的文本内容 lat
和 lon
)。此外,因为时间
和 ele
元素也使用默认命名空间,您必须使用<$ c $再次选择这些节点。
In order to also get time and elevation, you can use trkpt.find()
to access these elements below the trkpt
node, and then element.text
to retrieve those elements' text content (as opposed to attributes like lat
and lon
). Also, because the time
and ele
elements also use the default namespace you'll have to use the {namespace}element
syntax again to select those nodes.
因此你可以使用这样的东西:
So you could use something like this:
NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(header)
root = lxml.etree.fromstring(x)
for trkpt in root.iter('{%s}trkpt' % NS):
lat = trkpt.get('lat')
lon = trkpt.get('lon')
ele = trkpt.find('{%s}ele' % NS).text
time = trkpt.find('{%s}time' % NS).text
row = lat, lon, ele, time
writer.writerow(row)
有关XML命名空间的详细信息,请参阅命名空间部分以及有关XML命名空间的Wikipedia文章。另请参阅 GPS交换格式,了解 .gpx $的一些详细信息c $ c>格式。
For more information on XML namespaces, see the Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces. Also see GPS eXchange Format for some details on the .gpx
format.
这篇关于在Python中将XML转换为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!