Python beautifulsoup提取没有标识符的值 [英] Python beautifulsoup extract value without identifier

查看：55 发布时间：2021/4/15 19:03:00 python regex beautifulsoup

本文介绍了Python beautifulsoup提取没有标识符的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了一个问题，不知道如何正确解决.我想提取价格(因此在第一个示例中为130欧元，在第二个示例中为130欧元).

I am facing a problem and don't know how to solve it properly. I want to extract the price (so in the first example 130€, in the second 130€).

问题在于属性一直在变化.因此我无法执行此类操作，因为我正在抓取数百个网站，并且在每个网站上，"id"属性的前两个字符可能会有所不同:

the problem is that the attributes are changing all the time. so I am unable to do something like this, because I am scraping hundreds of sites and and on each site the first 2 chars of the "id" attribute may differ:

tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')})

即使我会使用像这样的东西也不会起作用，因为没有价格链接，我可能会得到其他一些价值:

Even if I would use something like this it wont work, because there is no link to the price and I would probably get some other value:

tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'([0-9]{2}_content$)')})

示例html代码:

<span id="07_lbl" class="lbl">Price:</span>
<span id="07_content" class="content">130  €</span>
<span id="08_lbl" class="lbl">Value:</span>
<span id="08_content" class="content">90000  €</span>


<span id="03_lbl" class="lbl">Price:</span>
<span id="03_content" class="content">130  €</span>
<span id="04_lbl" class="lbl">Value:</span>
<span id="04_content" class="content">90000  €</span>

我目前唯一能想到的就是用诸如"text ='Price:'"之类的价格来标识价格标签，然后获取 .next_sibling 并提取字符串.但我不确定是否有更好的方法.有什么建议么?:-)

The only thing I can imagine of at the moment is to identify the price tag with something like "text = 'Price:'" and after that get .next_sibling and extract the string. but I am not sure if there is better way to do it. Any suggestions? :-)

推荐答案

在这里，您可以轻松地仅提取价格值，就像您在原始帖子中所想到的那样.

Here is how you would easily extract only the price values like you had in mind in your original post.

html = """
        <span id="07_lbl" class="lbl">Price:</span>
        <span id="07_content" class="content">130  €</span>
        <span id="08_lbl" class="lbl">Value:</span>
        <span id="08_content" class="content">90000  €</span>


        <span id="03_lbl" class="lbl">Price:</span>
        <span id="03_content" class="content">130  €</span>
        <span id="04_lbl" class="lbl">Value:</span>
        <span id="04_content" class="content">90000  €</span>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

price_texts = soup.find_all('span', text='Price:')
for element in price_texts:
    # .next_sibling() might work, too, with a parent element present
    price_value = element.find_next_sibling('span')
    print price_value.get_text()

# It prints:
# 130  €
# 130  €

此解决方案的代码更少，IMO更清晰.

This solution has less code and, IMO, is more clear.

这篇关于Python beautifulsoup提取没有标识符的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python beautifulsoup提取没有标识符的值 [英] Python beautifulsoup extract value without identifier

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python beautifulsoup提取没有标识符的值 [英] Python beautifulsoup extract value without identifier

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭