使用 ElementTree 解析 XML 时如何获取子节点的文本值 [英] How do I pick up text values of child nodes when parsing XML with ElementTree

查看:51
本文介绍了使用 ElementTree 解析 XML 时如何获取子节点的文本值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一堆产品的 XML 购物提要,见下文.如果我用漂亮的汤来解析它,以创建一个熊猫数据框,我会使用这样的东西:

I have an XML shopping feed with a bunch of products, see below. If I'd parse this with beautiful soup, to create a pandas dataframe, I'd use something like this:

def parse_shopping_feed(feed_xml):
    #response = requests.get(feed_url)
    soup = BeautifulSoup(feed_xml, "xml")
    all_products = []
    for item in soup.find_all("item"):
        new_product = {
            "id": item.id.string,
            "title": item.title.string,
            "description": item.description.string,
            "google_product_category": item.google_product_category.string,
            "product_type": item.product_type.string if  "product_type" in item else "",
            "link": item.link.string,
            "availability": item.availability.string,
            "price": item.price.string,
            "brand": item.brand.string
        }
        all_products.append(new_product)
    feed_df = pd.DataFrame(all_products)
    return feed_df

现在,Beautiful Soup 对于其中一个提要(大约 300mbs)来说太慢了,所以已经开始研究应该更快的 ElementTree.但是,我终其一生都无法弄清楚我会用 ET 重新创建此代码.

Now, Beautiful Soup is too slow for one of these feeds (around 300mbs) so have started looking at ElementTree which is supposed to be faster. However I can't for the life of me figure out I would recreate this code with ET.

例如,如何遍历所有项目标签并获取它们的 ID 和标题?

How do I loop through all of the item tags and grap their ID and title for example?

我目前最好的猜测是这样的,但我不知道如何获取每个 ID 和标题.

My current best guess is something like this, but I don't get how pick up each ID and title.

xml_file = open('shopping_feed.xml')
for event, element in ET.iterparse(xml_file, events=None):
    for child in element:
        print(child)
    element.clear()

有什么建议吗?明确地说,我的最终目标是数据框,所以如果有一种方法可以直接转换它,那就太好了!

Any suggestions? To be clear, my end goal is the dataframe, so if there's a way to just convert it directly that'd be great!

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
    <title>Feed XYZ</title>
    <description></description>
    <link></link>
    <item>
        <g:id>10000005</g:id>
        <title><![CDATA[TEst Item XYZ                                           ]]></title>
        <g:google_product_category>Food and stuff</g:google_product_category>
        <g:product_type><![CDATA[Details &gt; Food and stuff]]></g:product_type>
        <g:adwords_grouping><![CDATA[Food and stuff]]></g:adwords_grouping>
        <link>https://www.abc.se/abc/abc</link>
        <g:image_link>https://www.abc.se/bilder/artiklar/10000005.jpg</g:image_link>
        <g:additional_image_link>https://www.abc.se/bilder/artiklar/zoom/10000005_1.jpg</g:additional_image_link>
        <g:condition>new</g:condition>
        <g:availability>out of stock</g:availability>
        <g:price>155 SEK</g:price>
        <g:buyprice>68.00</g:buyprice>
        <g:brand>ABC</g:brand>
        <g:gtin>8003299920846</g:gtin>
        <g:mpn>ABC01 AZ</g:mpn> 
        <g:weight>0 g</g:weight> 
        <g:item_group_id>10000008r</g:item_group_id>
        <g:color>Blue</g:color>
//100s of thousand of products

推荐答案

找到解决方案:

import lxml.etree as et
xml_data = open('feed.xml')
xml_data = xml_data.read()
data = et.fromstring(xml_data.encode("utf-8"))
items = data.xpath('//item')
​
all_products = []
prefix = "{http://base.google.com/ns/1.0}"
for item in items:
    new_product = {
        "id": item.find(prefix+ 'id').text,
        "title": item.find('title').text, 
        "google_product_category": item.find(prefix + 'google_product_category').text,
        "product_type": item.find(prefix + 'product_type').text,
        "link": item.find('link').text,
        "availability": item.find(prefix + 'availability').text,
        "price": item.find(prefix + 'price').text,
        "brand": item.find(prefix + 'brand').text
    }
    all_products.append(new_product)

这篇关于使用 ElementTree 解析 XML 时如何获取子节点的文本值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆