元素中img src的xpath [英] xpath for img src within element

查看:80
本文介绍了元素中img src的xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将如何修改以下代码,以便找出在description元素(包含html)中找到的所有图像的来源?目前,它只是从元素内部获取全文,我不确定如何修改它以获取任何img标签的来源.

How would I modify the below code so it picks out the source of any images found within the description element, which contains html? At the moment it just gets the full text from inside the element and I'm not sure how to modify this to get the sources of any img tags.

>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', description.text

最后我也尝试过:

print(description.xpath("//img/@src"))

这让我无"

XML结构为:

<guides>
<guide>
    <id>guide 1</id>
    <group>
    <id></id> 
    <type></type>
    <name></name>
    </group>
    <pages>
        <page>
            <id>page 1</id>
            <name></name>
            <description>&lt;p&gt;Some text. &lt;br /&gt;&lt;img 
            width=&quot;81&quot; 
            src=&quot;http://www.example.com/img.jpg&quot; 
             alt=&quot;wave&quot; height=&quot;63&quot; style=&quot;float: 
              right;&quot; /&gt;&lt;/p&gt;</description>
            <boxes>
                <box>
                    <id></id>
                    <name></name>
                    <type></type>
                    <map_id></map_id>
                    <column></column>
                    <position></position>
                    <hidden></hidden>
                    <created></created>
                    <updated></updated>
                    <assets>
                        <asset>
                            <id></id>
                            <name></name>
                            <type></type>
                       <description>&lt;img src=&quot;https://www.example.com/image.jpg&quot; alt=&quot;image&quot; height=&quot;42&quot; width=&quot;42&quot;&gt;</description>
                            <url/>
                            <owner>
                                <id></id>
                                <email></email>
                                <first_name></first_name>
                                <last_name></last_name>
                            </owner>
                        </asset>
                    </assets>
                </box>
            </boxes>
        </page>
    </pages>
</guide>

推荐答案

description元素的内容为HTML.有多种解析方法,其中一种是lxml中的html.

The content of the description element is HTML. There are various ways of parsing it, one of them being html from lxml.

>>> description.text
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">'
>>> from lxml import html
>>> img = html.fromstring(description.text)
>>> img.attrib['src']
'https://www.example.com/image.jpg'

编辑以回应评论:

>>> from lxml import etree, html
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', html.fromstring(description.text).attrib['src']
... 
('---', 'guide 1')
('------', 'page 1')
('---------', 'https://www.example.com/image.jpg')


处理异常.


Handling exception.

替换

'---------', html.fromstring(description.text).attrib['src']

try:
    '---------', html.fromstring(description.text).attrib['src']

except KeyError:
    '--------- No image URL present'

编辑,回复11月9日的评论:

Edit, responding to 9 Nov comment:

from lxml import etree, html
tree = etree.parse('guides.xml')
for guide in tree.xpath('guide'):
    print('---', guide.xpath('id')[0].text)
    for pages in guide.xpath('.//pages'):
        for page in pages:
            print('------', page.xpath('id')[0].text)
            for description in page.xpath('.//asset/description'):
                try:
                    print('---------', html.fromstring(description.text).attrib['src'])
                except TypeError:
                    print('--------- no src identifiable')
                except KeyError:
                    print('--------- no src identifiable')

第2个guide元素根本不包含HTML,第3个guide元素不包含src属性的xml文件的输出.

Output for xml file where 2nd guide element contains no HTML at all, and 3rd contains HTML without a src attribute.

--- guide 1
------ page 1
--------- https://www.example.com/image.jpg
--- guide 2
------ page 1
--------- no src identifiable
--- guide 3
------ page 1
--------- no src identifiable
--- guide 4
------ page 1
--------- https://www.example.com/image.jpg

这篇关于元素中img src的xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆