元素中img src的xpath [英] xpath for img src within element
问题描述
我将如何修改以下代码,以便找出在description元素(包含html)中找到的所有图像的来源?目前,它只是从元素内部获取全文,我不确定如何修改它以获取任何img标签的来源.
How would I modify the below code so it picks out the source of any images found within the description element, which contains html? At the moment it just gets the full text from inside the element and I'm not sure how to modify this to get the sources of any img tags.
>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', description.text
最后我也尝试过:
print(description.xpath("//img/@src"))
这让我无"
XML结构为:
<guides>
<guide>
<id>guide 1</id>
<group>
<id></id>
<type></type>
<name></name>
</group>
<pages>
<page>
<id>page 1</id>
<name></name>
<description><p>Some text. <br /><img
width="81"
src="http://www.example.com/img.jpg"
alt="wave" height="63" style="float:
right;" /></p></description>
<boxes>
<box>
<id></id>
<name></name>
<type></type>
<map_id></map_id>
<column></column>
<position></position>
<hidden></hidden>
<created></created>
<updated></updated>
<assets>
<asset>
<id></id>
<name></name>
<type></type>
<description><img src="https://www.example.com/image.jpg" alt="image" height="42" width="42"></description>
<url/>
<owner>
<id></id>
<email></email>
<first_name></first_name>
<last_name></last_name>
</owner>
</asset>
</assets>
</box>
</boxes>
</page>
</pages>
</guide>
推荐答案
description
元素的内容为HTML.有多种解析方法,其中一种是lxml
中的html
.
The content of the description
element is HTML. There are various ways of parsing it, one of them being html
from lxml
.
>>> description.text
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">'
>>> from lxml import html
>>> img = html.fromstring(description.text)
>>> img.attrib['src']
'https://www.example.com/image.jpg'
编辑以回应评论:
>>> from lxml import etree, html
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', html.fromstring(description.text).attrib['src']
...
('---', 'guide 1')
('------', 'page 1')
('---------', 'https://www.example.com/image.jpg')
处理异常.
Handling exception.
替换
'---------', html.fromstring(description.text).attrib['src']
与
try:
'---------', html.fromstring(description.text).attrib['src']
except KeyError:
'--------- No image URL present'
编辑,回复11月9日的评论:
Edit, responding to 9 Nov comment:
from lxml import etree, html
tree = etree.parse('guides.xml')
for guide in tree.xpath('guide'):
print('---', guide.xpath('id')[0].text)
for pages in guide.xpath('.//pages'):
for page in pages:
print('------', page.xpath('id')[0].text)
for description in page.xpath('.//asset/description'):
try:
print('---------', html.fromstring(description.text).attrib['src'])
except TypeError:
print('--------- no src identifiable')
except KeyError:
print('--------- no src identifiable')
第2个guide元素根本不包含HTML,第3个guide元素不包含src属性的xml文件的输出.
Output for xml file where 2nd guide element contains no HTML at all, and 3rd contains HTML without a src attribute.
--- guide 1
------ page 1
--------- https://www.example.com/image.jpg
--- guide 2
------ page 1
--------- no src identifiable
--- guide 3
------ page 1
--------- no src identifiable
--- guide 4
------ page 1
--------- https://www.example.com/image.jpg
这篇关于元素中img src的xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!