从Python中的XML节点解析文本 [英] Parsing text from XML node in Python

查看:105
本文介绍了从Python中的XML节点解析文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从站点地图中提取网址,例如: https:// www。 bestbuy.com/sitemap_c_0.xml.gz

I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

我已经解压缩了.xml.gz文件并将其另存为.xml文件。结构如下所示:

I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

我正在尝试使用ElementTree提取 loc内的所有URL

I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.

根据文档,我正在尝试如下操作:

Per the documentation, I'm trying something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()

value = root.findall(".//loc")

但是,没有任何东西变成价值。我的目标是提取loc节点之间的所有URL,并将其打印到新的平面文件中。我要去哪里错了?

However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?

推荐答案

我们可以遍历URL,将它们放入列表中,然后将它们写到文件中,如下所示:例如:

We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)




  • 请注意,我们需要在打开的urlset定义中追加名称空间,以正确解析xml

  • 这篇关于从Python中的XML节点解析文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆