美丽的汤找不到我想要的HTML的一部分 [英] Beautiful Soup can't find the part of the HTML I want

查看:70
本文介绍了美丽的汤找不到我想要的HTML的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一段时间以来,我一直在使用BeautifulSoup进行Web爬网,这是我第一次遇到这样的问题.我试图在代码中选择数字101,172,但是即使我使用.find或.select,输出也始终只是标记,而不是数字.我以前曾进行过类似的数据收集工作,但没有遇到任何问题

I've been using BeautifulSoup for Web Scraping for a while and this is the first time I encountered a problem like this. I am trying to select the number 101,172 in the code but even though I use .find or .select, the output is always only the tag, not the number. I worked with similar data collection before and hadn't had any problems

<div class="legend-block legend-block--pageviews">
      <h5>Pageviews</h5><hr>
      <div class="legend-block--body">
        <div class="linear-legend--counts">
          Pageviews:
          <span class="pull-right">
            101,172
          </span>
        </div>
        <div class="linear-legend--counts">
          Daily average:
          <span class="pull-right">
            4,818
          </span>
        </div></div></div>

我用过:

res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
#print(i)
print(ab)

输出:

[<span class="pull-right">\n<label class="logarithmic-scale">\n<input 
class="logarithmic-scale-option" type="checkbox"/>\n        Logarithmic scale      
</label>\n</span>, <span class="pull-right">\n<label class="begin-at- 
zero">\n<input class="begin-at-zero-option" type="checkbox"/>\n        Begin at 
zero      </label>\n</span>, <span class="pull-right">\n<label class="show- 
labels">\n<input class="show-labels-option" type="checkbox"/>\n        Show 
values      </label>\n</span>]

此外,我要查找的数据号是动态的,因此不确定Javascript是否会影响BeautifulSoup

Additionally, the data number I am looking for is dynamic, so I am not sure if Javascript would affect BeautifulSoup

推荐答案

尝试一下:

from bs4 import BeautifulSoup as bs

html='''<div class="legend-block legend-block--pageviews">
      <h5>Pageviews</h5><hr>
      <div class="legend-block--body">
        <div class="linear-legend--counts">
          Pageviews:
          <span class="pull-right">101,172
          </span>
        </div>
        <div class="linear-legend--counts">
          Daily average:
          <span class="pull-right">
            4,818
          </span>
        </div></div></div>'''
soup = bs(html, 'html.parser')
div = soup.find("div", {"class": "linear-legend--counts"})
span = div.find('span')
text = span.get_text()
print(text)

输出:

101,172

仅一行:

soup = bs(html, 'html.parser')
result = soup.find("div", {"class": "linear-legend--counts"}).find('span').get_text()

由于OP发布了另一个问题,该问题可能是该问题的重复,所以他找到了答案.对于正在寻找类似问题的答案的人,我将发布该问题的可接受答案.可以在此处找到.

As OP has posted another question which can be a possible duplicate for this one, He had found an answer. For someone who is looking for an answer for a similar kind of a question I will post the accepted answer for this question. It can be found here.

如果您检索带有requests.get的页面,则JavaScript代码将不会执行.因此,应改为使用硒.在浏览器中打开页面时,它将模仿用户喜欢的行为,因此将执行js代码.

The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.

要从硒开始,您需要安装pip install selenium.然后使用以下代码检索您的商品:

To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:

from selenium import webdriver

browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
               ".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
    url = wiki_page[0]
    selector = wiki_page[1]
    browser.get(wiki_page)
    page_views_count = browser.find_element_by_css_selector(selector)
    print page_views_count.text
browser.quit()

注意:如果您需要运行无头浏览器,请考虑使用 PyVirtualDisplay ( Xvfb 的包装)运行无头WebDriver测试,请参见'

NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.

这篇关于美丽的汤找不到我想要的HTML的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆