使用ElementTree时出现未定义的实体错误 [英] Undefined entity error while using ElementTree
问题描述
我需要读取一组XML文件并将其格式化为单个CSV文件。为了读取XML文件,我使用了此处提到的解决方案
I have a set of XML files that I need to read and format into a single CSV file. In order to read from the XML files, I have used the solution mentioned here.
我的代码如下:
from os import listdir
import xml.etree.cElementTree as et
files = listdir(".../blogs/")
for i in range(len(files)):
# fname = ".../blogs/" + files[i]
f = open(".../blogs/" + files[i], 'r')
contents = f.read()
tree=et.fromstring(contents)
for el in tree.findall('post'):
post = el.text
f.close()
这给了我 cElementTree错误。 ParseError:未定义实体:
在 tree = et.fromstring(contents)
行。奇怪的是,当我在Python命令行上运行每个命令时(尽管没有for循环),它运行得很好。
This gives me the error cElementTree.ParseError: undefined entity:
at the line tree=et.fromstring(contents)
. Oddly enough, when I run each of the commands on command line Python (without the for-loop though), it runs perfectly.
如果您想了解XML
<Blog>
<date> some date </date>
<post> some blog post </post>
</Blog>
那么是什么导致了此错误,以及它为什么不能从Python文件运行,但是从命令行运行?
So what is causing this error, and how come it doesn't run from the Python file, but runs from the command line?
更新:阅读此链接我检查了个文件[0]
,发现&符号出现了几次。我认为这可能是造成问题的原因。当我在命令行上运行相同的命令时,我使用了一个随机文件进行读取。
Update: After reading this link I checked files[0]
and found that '&' symbol occurs a few times. I think that might be causing the problem. I used a random file to read when I ran the same commands on command line.
推荐答案
正如我在更新中提到的那样,是我怀疑可能引起问题的一些符号。
当我在命令行上运行相同的行时未出现错误的原因是因为我会随机选择一个没有任何此类字符的文件。
As I mentioned in the update, there were some symbols that I suspected might be causing a problem. The reason the error didn't come up when I ran the same lines on the command line is because I would randomly pick a file that didn't have any such characters.
因为我主要需要< post>
和< / post>
标记之间的内容,我创建了自己的解析器(如更新中提到的链接)。
Since I mainly required the content between the <post>
and </post>
tags, I created my own parser (as was suggested in the link mentioned in the update).
from os import listdir
files = listdir(".../blogs/")
for i in range(len(files)):
f = open(".../blogs/" + files[i], 'r')
contents = f.read()
seek1 = contents.find('<post>')
seek2 = contents.find('</post>', seek1+1)
while(seek1!=-1):
post = contents[seek1+5:seek2+6]
seek1 = contents.find('<post>', seek1+1)
seek2 = contents.find('</post>', seek1+1)
f.close()
这篇关于使用ElementTree时出现未定义的实体错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!