使用BeautifulSoup Python中的XML解析 [英] XML parsing in Python using BeautifulSoup

查看:123
本文介绍了使用BeautifulSoup Python中的XML解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢您提前支付我的问题的关注。

Thank you in advance for paying attention of my question.

当我想用BeautifulSoup图书馆在Python解析XML文档,
我面临着一些问题。

When I want to parsing XML document in Python using BeautifulSoup library, I faced some problems.

这是我要配对的XML文档就是这样。

The xml document that I want to paring is like that.

<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>

正如你可以在上面看到,标签是有点怪。
在我看来,这(标签)是不是一个独立的XML格式,对不对?
(好吧,如果我错了,让我知道。)

As you can see above, tag is a little strange. In my opinion, that( tag) is not a stand XML form, right? (Well if I am wrong, let me know it.)

不过,我必须分析这个形式,因为我的客户发送这样的
他们不能改变它。

Anyway, I have to parse this form since my customer send it like that and they can't change it.

如何解析这个可怕的形式?
请让我知道解决这个问题的最好办法。

How can I parse this terrible form? Please let me know the best way to solve this problem.

感谢我的上师。

推荐答案

您可以使用<一个href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20XML\">BeautifulSoup解析XML:

You could use BeautifulSoup to parse XML:

import bs4 as bs
content='''\
<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>'''    

soup = bs.BeautifulSoup(content, 'xml')

title = soup.title
print(title.string)
# Title Sample

link = soup.link.nextSibling
print(link)
# http://banhada.kr/?cateCode=09&viewCode=S0941580

引擎盖下,BeautifulSoup使用 LXML 用于解析XML。
虽然它不是在这里需要的话,你可能想直接使用lxml的,因为它为您提供了更简洁的方法,通过XML使用XPath导航:

Under the hood, BeautifulSoup uses lxml for parsing XML. Although it's not needed here, you might want to use lxml directly, since it gives you more succinct ways to navigate through XML using XPath:

import lxml.etree as ET

content='''\
<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>'''    

doc = ET.fromstring(content)

title = doc.find('title')
print(title.text)
# Title Sample

link = doc.find('link')
print(link.tail)
# http://banhada.kr/?cateCode=09&viewCode=S0941580

这篇关于使用BeautifulSoup Python中的XML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆