如何在使用lxml/xpath的XML导出中查找带有IMG标签的所有指南ID和页面? [英] How to find all guide IDs and pages with IMG tags in XML export with lxml/xpath?

查看:43
本文介绍了如何在使用lxml/xpath的XML导出中查找带有IMG标签的所有指南ID和页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我该如何解析以下XML,以便找到每个GUIDE的ID和UL,然后找到GUIDE中的每个PAGE,页面ID以及BOXES/BOX/ASSETS/DESCRIPTION中出现的图像?图片为HTML格式,因此我需要从每张图片中获取源代码.

How can I parse the below XML in order to find for each GUIDE, it's ID and UL, then for each PAGE inside GUIDE, the page ID and any images that appear inside BOXES / BOX / ASSETS / DESCRIPTION? The images are in HTML format so I need to grab the source from each image.

  <guide>
    <id></id>
   <url></url>
  <group>
   <id></id> 
<type></type>
<name></name>
   </group>
   <pages>
    <page>
 <id></id>
 <name></name>
 <description></description>
 <boxes>
  <box>
   <id></id>
   <name></name>
   <type></type>
   <map_id></map_id>
   <column></column>
   <position></position>
   <hidden></hidden>
   <created></created>
   <updated></updated>
   <assets>
    <asset>
     <id></id>
     <name></name>
     <type></type>
     <description></description>
     <url/>
     <owner>
      <id></id>
      <email></email>
      <first_name></first_name>
      <last_name></last_name>
     </owner>
    </asset>
      </assets>
     </box>
    </boxes>
   </page>
   </pages>
    </guide>

这给了我带有其ID和描述的页面,但这是我需要访问的资产元素内部的描述以及它们所在的指南/页面.

This gives me the pages with their ID and descriptions but it's the descriptions inside the asset elements I need to access, and the guide/page they are on.

from lxml import etree
tree = etree.parse('temp.xml')
for page in tree.xpath('.//page'):
    page.xpath('id')[0].text, page.xpath('description')[0].text

推荐答案

代码的模式可能相似,但由于我没有完整的xml,因此我无法对其进行检查.

The pattern of the code is probably similar but I can't check this because I don't have your full xml.

>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', description.text
... 
('---', 'guide 1')
('------', 'page 1')
('---------', 'description')

我假设您的xml将具有多个guide元素.这就是我解析的.

I assumed that your xml would have multiple guide elements. This is what I parsed.

<guides>
    <guide>
        <id>guide 1</id>
        <url></url>
        <group>
        <id></id> 
        <type></type>
        <name></name>
        </group>
        <pages>
            <page>
                <id>page 1</id>
                <name></name>
                <description></description>
                <boxes>
                    <box>
                        <id></id>
                        <name></name>
                        <type></type>
                        <map_id></map_id>
                        <column></column>
                        <position></position>
                        <hidden></hidden>
                        <created></created>
                        <updated></updated>
                        <assets>
                            <asset>
                                <id></id>
                                <name></name>
                                <type></type>
                                <description>description</description>
                                <url/>
                                <owner>
                                    <id></id>
                                    <email></email>
                                    <first_name></first_name>
                                    <last_name></last_name>
                                </owner>
                            </asset>
                        </assets>
                    </box>
                </boxes>
            </page>
        </pages>
    </guide>
</guides>

我通过缩进xml来使自己的生活更轻松,以便我可以识别其结构.

I made life easier for myself by indenting the xml so that I could discern its structure.

这篇关于如何在使用lxml/xpath的XML导出中查找带有IMG标签的所有指南ID和页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆