XML 文件的结构阻止我使用 python 读取它 [英] Structure of XML file is preventing me from reading it with python

查看:17
本文介绍了XML 文件的结构阻止我使用 python 读取它的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设置一个 python 脚本,该脚本将要求输入所有具有相同格式的 xml 文件列表,并从每个 xml 文件中读出特定行.

I'm setting up a python script that will ask for a list of input xml files that all have the same format and read out a specific line from each xml file.

一切都如我所愿,但是由于 xml 文件本身的内容,我在从 xml 文件读取时出错.

Everything works as I want it to, however I am getting an error when reading from the xml file due to the content of the xml file itself.

我通过编辑 xml 文件让脚本工作,但这对我来说不是解决方案,因为我需要这个脚本来运行数千个文件

I have got the script to work by editing the xml file but this is not a solution for me as I need this script to run thousands of files

这是我正在使用的代码:

here is the code I'm using:

import os
import tkinter as tk
from tkinter import filedialog
import xml.etree.ElementTree as ET


root = tk.Tk()
root.withdraw()

file_path = filedialog.askopenfilenames()

tup=0

count = len(file_path)

for i in range(len(file_path)):
    filename = os.path.basename(file_path[tup])
    print('file =',os.path.basename(' '.join(file_path)))
    tree = ET.parse(file_path[tup])
    root = tree.getroot()
    for child in root:
        data = child.tag
        print(data)
    for data in root.findall(data):
        name = data.find('subdata2').text
        print('ID =', name)
    tup +=1

这里是xml的一个例子:

and here is an example of the xml:

<?xml version="1.0"?>
<Data xmlns="link">
    <subdata1 id = "something">
        <subdata2>data
            <subdata3>data</subdata3>
        </subdata2>
    </subdata1>
</Data>

问题来自附加到根link3"的文本它从

The problem comes from the text attached to the root "link3" it changes the tag of subdata1 from

subdata1

 {link}subdata1

然后改变输出:

ID = data

到:

Traceback (most recent call last):
  File "debug.py", line 25, in <module>
    name = data.find('subdata2').text
AttributeError: 'NoneType' object has no attribute 'text'

是否有另一种不涉及修改 xml 文件本身的从该 xml 文件中提取数据的方法?

is there another way of extracting the data from this xml file that doesn't involve modifying the xml file itself?

推荐答案

您可以从解析的 xml 中剥离命名空间,而不是从 xml 本身中剥离.

You can strip the namespaces from the parsed xml instead of the xml itself.

tree = ET.iterparse(file_path)
for _, el in tree:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
root = tree.root
for child in root:
    # ... (REST OF CODE)

阅读更多 这里

此外,如果您不介意速度不足但又想要极致的简单性,那么还有一个选择,您可以使用 untangle.由于您的 XML 的结构显然都相同,这对您来说可能很方便.

Also, another option if you don't mind a lack of speed but want ultimate simplicity, you can use untangle. Being as how your XML is apparently all structured the same, this might be convenient for you.

import untangle

root = untangle.parse(file_path)
print(root.Data.subdata1['id'])
print(root.Data.subdata1.subdata2.cdata)

<小时>

我也忘记了我最喜欢的选项.xmltodict 将 xml 转换为 Python OrderedDict 对象.


I also forgot my favorite option. xmltodict converts xml into Python OrderedDict objects.

import xmltodict

with open(xmlPath, 'rb') as fd:
    xmlDict = xmltodict.parse(fd)
print(xmlDict['Data']['subdata1']['@id'])
print(xmlDict['Data']['subdata1']['subdata2']['#text'])

如您所见,命名空间不会成为问题.如果您熟悉 Python dicts,那么遍历并找到您想要的内容将非常简单.

As you can see, namespaces won't be an issue. And if you are familiar with Python dicts then it will be very simple to iterate through and find what you want.

这篇关于XML 文件的结构阻止我使用 python 读取它的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆