无法理解BeautifulSoup过滤 [英] Having problems understanding BeautifulSoup filtering

查看:86
本文介绍了无法理解BeautifulSoup过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释一下美丽汤"如何进行过滤.香港专业教育学院得到了下面的HTML,我试图从中过滤特定的数据,但我似乎无法访问它.香港专业教育学院尝试了各种方法,从收集所有class=g到只抓取该特定div中感兴趣的项目,但我只得到无回报或没有印刷品.

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.

每个页面都有一个<div class="srg"> div和多个<div class="g"> div,我要使用的数据是带有<div class="g">的数据.每个都有 多个div,但我只对<cite><span class="st">数据感兴趣.我正在努力了解过滤的工作原理,我们将不胜感激.

Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.

我已尝试逐步遍历div并抓住相关字段:

I have attempted stepping through the divs and grabbing the relevant fields:

 soup = BeautifulSoup(response.text)   

 main = soup.find('div', {'class': 'srg'})
 result = main.find('div', {'class': 'g'})
 data = result.find('div', {'class': 's'})
 data2 = data.find('div')
 for item in data2:
     site = item.find('cite')
     comment = item.find('span', {'class': 'st'})

 print site
 print comment

我还尝试过进入初始div并查找所有内容;

I have also attempted stepping into the initial div and finding all;

 soup = BeautifulSoup(response.text) 

 s = soup.findAll('div', {'class': 's'})

 for result in s:
     site = result.find('cite')
     comment = result.find('span', {'class': 'st'})

 print site
 print comment

测试数据

<div class="srg">
    <div class="g">
    <div class="g">
    <div class="g">
    <div class="g">
        <!--m-->
        <div class="rc" data="30">
            <div class="s">
                <div>
                    <div class="f kv _SWb" style="white-space:nowrap">
                        <cite class="_Rm">http://www.url.com.stuff/here</cite>
                    <span class="st">http://www.url.com. Some info on url etc etc
                    </span>
                </div>
            </div>
        </div>
        <!--n-->
    </div>
    <div class="g">
    <div class="g">
    <div class="g">
</div>

更新

在Alecxe解决方案之后,我再次尝试使它正确,但仍然没有打印出任何东西.因此,我决定再次查看soup,它看起来有所不同.我以前在看requests中的response.text.我只能认为BeautifulSoup会修改response.text,或者我第一次以某种方式完全弄错了示例(不确定如何).但是,以下是基于我从soup打印中看到的内容的新示例.在此之下,我尝试获取我要的元素数据.

After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.

<li class="g">
<h3 class="r">
    <a href="/url?q=url">context</a>
</h3>
<div class="s">
    <div class="kv" style="margin-bottom:2px">
        <cite>www.url.com/index.html</cite> #Data I am looking to grab
        <div class="_nBb">‎
            <div style="display:inline"snipped">
                <span class="_O0"></span>
            </div>
            <div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
                <ul>
                    <li class="_Ykb">
                        <a class="_Zkb" href="/url?/search">Cached</a>
                    </li>
                </ul>
            </div>
        </div>
    </div>
    <span class="st">Details about URI </span> #Data I am looking to grab

更新尝试

到目前为止,我已经尝试过采用Alecxe的方法,但没有成功,我是否正沿着正确的道路前进?

I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?

soup = BeautifulSoup(response.text)

for cite in soup.select("li.g div.s div.kv cite"):
    span = cite.find_next_sibling("span", class_="st")

    print(cite.get_text(strip=True))
    print(span.get_text(strip=True))

推荐答案

您不必手动处理层次结构-让BeautifulSoup担心它.第二种方法接近您应该真正尝试做的事情,但是一旦获得divclass="s"并且内部没有cite元素,该方法就会失败.

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.

相反,您需要让BeautifulSoup知道您对包含特定元素的特定元素感兴趣.让我们要求位于div元素内的div元素内的div元素与class="srg"-div.srg div.g cite

Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:

for cite in soup.select("div.srg div.g cite"):
    span = cite.find_next_sibling("span", class_="st")

    print(cite.get_text(strip=True))
    print(span.get_text(strip=True))

然后,一旦找到cite,我们就侧身走"并用class="st"抓住下一个span兄弟元素.不过,是的,在这里我们假设它存在.

Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.

对于提供的样本数据,它将打印:

For the provided sample data, it prints:

http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc


用于更新的输入数据的更新的代码:


The updated code for the updated input data:

for cite in soup.select("li.g div.s div.kv cite"):
    span = cite.find_next("span", class_="st")

    print(cite.get_text(strip=True))
    print(span.get_text(strip=True))


此外,请确保您使用的是第4个BeautifulSoup版本:


Also, make sure you are using the 4th BeautifulSoup version:

pip install --upgrade beautifulsoup4

并且导入语句应为:

from bs4 import BeautifulSoup

这篇关于无法理解BeautifulSoup过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆