从Python中的XML节点解析文本 [英] Parsing text from XML node in Python

查看：105 发布时间：2020/10/28 20:36:27 python xml python-3.x elementtree

本文介绍了从Python中的XML节点解析文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从站点地图中提取网址，例如： https：// www。 bestbuy.com/sitemap_c_0.xml.gz

I'm trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

我已经解压缩了.xml.gz文件并将其另存为.xml文件。结构如下所示：

I've unzipped and saved the .xml.gz file as an .xml file. The structure looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

我正在尝试使用ElementTree提取 loc内的所有URL

I'm attempting to use ElementTree to extract all of the URLs within the loc nodes throughout this file, but struggling to get it working right.

根据文档，我正在尝试如下操作：

Per the documentation, I'm trying something like this:

import xml.etree.ElementTree as ET tree = ET.parse('my_local_filepath') root = tree.getroot() value = root.findall(".//loc")

但是，没有任何东西变成价值。我的目标是提取loc节点之间的所有URL，并将其打印到新的平面文件中。我要去哪里错了？

However, nothing gets loaded into value. My goal is to extract all of the URLs between the loc nodes and print it out into a new flat file. Where am I going wrong?

推荐答案

我们可以遍历URL，将它们放入列表中，然后将它们写到文件中，如下所示：例如：

We can iterate through the URLs, toss them into a list and write them to a file as such:

from xml.etree import ElementTree as ET tree = ET.parse('test.xml') root = tree.getroot() name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}' urls = [] for child in root.iter(): for block in child.findall('{}url'.format(name_space)): for url in block.findall('{}loc'.format(name_space)): urls.append('{}\n'.format(url.text)) with open('sample_urls.txt', 'w+') as f: f.writelines(urls)

请注意，我们需要在打开的urlset定义中追加名称空间，以正确解析xml

这篇关于从Python中的XML节点解析文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Python中的XML节点解析文本 [英] Parsing text from XML node in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从Python中的XML节点解析文本 [英] Parsing text from XML node in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭