解析非标准 XML(CDATA 标签) [英] Parsing non-standard XML (CDATA tag)

查看:41
本文介绍了解析非标准 XML(CDATA 标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我想使用 BeautifulSoup 库在 Python 中解析 XML 文档时,我遇到了一些问题.我要解析的 XML 文档:

<title><![CDATA[Title Sample]]></title><link/><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]><time_start>2011-10-10 09:00:00</time_start><time_end>2011-10-17 09:00:00</time_end><price_original>35000</price_original><price_now>20000</price_now></项目>

正如你在上面看到的,标签有点奇怪.在我看来,that(tag) 不是一个标准的 XML 形式,对吧?我该如何解析这种可怕的形式?

解决方案

您不需要 BeautifulStoneSoup 或 lxml.Python 附带的电池可以很好地完成这项工作,而且您的 XML 似乎没有任何不合规的地方.

<预><代码>>>>内容='''... <项目>... <title><![CDATA[Title Sample]]></title>... <link/><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>... <time_start>2011-10-10 09:00:00</time_start>... <time_end>2011-10-17 09:00:00</time_end>... <price_original>35000</price_original>... <price_now>20000</price_now>... </item>'''>>>导入 xml.etree.cElementTree as et>>>foo = et.XML(内容)>>>对于 foo 中的 e:... 打印 e.tag、e.text、repr(e.tail)...标题 标题示例 ' '链接 无 'http://banhada.kr/?cateCode=09&viewCode=S0941580 'time_start 2011-10-10 09:00:00 ' 'time_end 2011-10-17 09:00:00 ' 'price_original 35000 ' 'price_now 20000 ' '>>>

When I want to parsing XML document in Python using BeautifulSoup library, I faced some problems. The XML document that I want to parse:

<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>

As you can see above, tag is a little strange. In my opinion, that( tag) is not a stand XML form, right? How can I parse this terrible form?

解决方案

You don't need BeautifulStoneSoup or lxml. Python's included batteries do the job just fine, and there doesn't seem to be anything non-compliant about your XML.

>>> content='''
... <item>
... <title><![CDATA[Title Sample]]></title>
... <link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
... <time_start>2011-10-10 09:00:00</time_start>
... <time_end>2011-10-17 09:00:00</time_end>
... <price_original>35000</price_original>
... <price_now>20000</price_now>
... </item>'''
>>> import xml.etree.cElementTree as et
>>> foo = et.XML(content)
>>> for e in foo:
...     print e.tag, e.text, repr(e.tail)
...
title Title Sample '
'
link None 'http://banhada.kr/?cateCode=09&viewCode=S0941580
'
time_start 2011-10-10 09:00:00 '
'
time_end 2011-10-17 09:00:00 '
'
price_original 35000 '
'
price_now 20000 '
'
>>>

这篇关于解析非标准 XML(CDATA 标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆