使用BS4"lxml"来刮取XML数据. [英] Scraping XML data with BS4 "lxml"

查看:52
本文介绍了使用BS4"lxml"来刮取XML数据.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图解决与这一问题非常相似的问题:

[使用beautifulsoup刮擦XML元素属性

我有以下代码:

从bs4

 导入BeautifulSoup汇入要求r = request.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')数据= r.text汤= BeautifulSoup(数据,"lxml")对于ce in soup.find_all("Cell"):打印(ce ["cell_value1"]) 

代码运行无误,但不会在终端上显示任何值.

我想提取整个页面上面提到的"cell_value1"数据,所以我有这样的东西:

  2468.583061.58376.64等等... 

我的XML文件的格式与上述问题的解决方案中的示例相同.我确定了要刮擦的特定属性标签.为什么值不打印到终端上?

解决方案

问题是您正在以HTML模式解析此文件,这意味着标记最终以'cell'命名,而不是单元格" .因此,您可以仅使用'cell'进行搜索,但是正确的答案是在XML模式下进行解析.

为此,只需使用'xml'作为解析器,而不是'lxml'.(在HTML模式下,'lxml'的意思是" lxml ",而 xml 的意思是" lxml 在XML模式下",但它已已记录)

其他解析器问题对此进行了解释:

因为 HTML标记和属性不区分大小写,所以这三个HTML解析器将标记和属性名称转换为小写.即,将标记< TAG></TAG> 转换为< tag></tag> .如果要保留大小写混合的标签和属性,则需要

Trying to solve problem very similar to this one:

[Scraping XML element attributes with beautifulsoup

I have the following code:

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')
data = r.text
soup = BeautifulSoup(data, "lxml")
for ce in soup.find_all("Cell"):
    print(ce["cell_value1"])

The code runs without error but does not print any values to the terminal.

I want to extract the "cell_value1" data noted above for the whole page so I have something like this:

2468.58
3061.58
376.64
and so on...

The format of my XML file is the same as the sample in the solution from the question noted above. I identified the appropriate attribute tag specific the attribute I want to scrape. Why are the values not printing to the terminal?

解决方案

The problem is that you're parsing this file in HTML mode, which means the tags end up named 'cell' instead of 'Cell'. So, you could just search with 'cell'—but the right answer is to parse in XML mode.

To do this, just use 'xml' as your parser instead of 'lxml'. (It's a little non-obvious that 'lxml' means "lxml in HTML mode" and xml means "lxml in XML mode", but it is documented.)

This is explained in Other parser problems:

Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup <TAG></TAG> is converted to <tag></tag>. If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML.


Your code is still fail because of a second problem: some of the Cell nodes are empty, and do not have a cell_value1 attribute to print out, but you're trying to print it out unconditionally.

So, what you want is something like this:

soup = BeautifulSoup(data, "xml")
for ce in soup.find_all("Cell"):
    try:
        print(ce["cell_value1"])
    except KeyError:
        pass

这篇关于使用BS4"lxml"来刮取XML数据.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆